Re: [HACKERS] SQL MERGE is quite distinct from UPSERT
On 07/20/2014 12:55 PM, Peter Geoghegan wrote: > There is a *really* big > demand for UPSERT from users, not MERGE, although MERGE is certainly > useful too. The inability to efficiently say "Add this unique-keyed row, or if a row of the same key already exists replace it atomically" is a fundamental defect in SQL its self. Vendors shouldn't need to be coming up with their own versions because the standard should really cover this - much like LIMIT and OFFSET. It's very high in the most frequently asked questions on Stack Overflow, right up there with questions about pg_hba.conf, connection issues on OS X, etc. I'd be very keen to see atomic upsert in Pg. Please Cc me on any patches / discussion, I'll be an eager tester. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SQL MERGE is quite distinct from UPSERT
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 07/21/2014 01:40 AM, Martijn van Oosterhout wrote: > FWIW, I agree. MERGE is hard enough as it is, but trying to > guarentee some kind of atomicity makes it nigh on impossible. > Indeed, after reading what you wrote I think it may well be > impossible to make it atomic *and* make it perform in the general > case. > > So, +1 UPSERT. I totally agree. Particularly because MERGE-like behaviour is already possible with wCTEs and/or table locking. It's not beautiful, but neither is MERGE. - -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -BEGIN PGP SIGNATURE- Version: GnuPG v1 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJTzKfbAAoJELBXNkqjr+S2GhIIALAMmpMuQiMqsJ/GHjCfeXYQ Tb3dO0ocBgpk8CobGEVjDnLOh4Rfqt4XZ9pEGr38XEmmzfjc2nEczk+PFq+bRKki d9lRk8BDH5fcyIYfCNXbycUBbJ/b+inLdhZI0wp3kGX6V1MWTuOquTp8NTbTzvcL tJXRyWEqsMuXIA26B31W3AkLAFaFF7fpZiD91SI7ECozg1Qr+Ey5tTjJj1+ErzAC 5MnK4nSwwbFTdS7SaOmzzfGKT7BoSlbAXbF8gshbBA5IPU7FxfBcvAquxpPalF73 /949kneIWDA3Qux73wmr182ph4U8usgODA0Iq6QAHa4IPJWFfCvyRA9vt6P86oM= =vVI+ -END PGP SIGNATURE- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5: UPDATE/DELETE .. ORDER BY .. LIMIT ..
On 2014-06-24 11:08, Heikki Linnakangas wrote: IMHO this needs to work with inheritance if we are to accept it. It would be a rather strange limitation for no apparent reason, other than that we didn't bother to implement it. It doesn't seem very difficult in theory to add the table OID to the plan as a junk column, and use that in the ModifyTable node to know which table a row came from. So I've been trying to look at this with Rukh, and it doesn't seem at all as easy as you make it out to be. Consider the following example: =# create table foo(a int); =# create table foo_c1(a int, b int) inherits (foo); =# create table foo_c2(a int, c text) inherits (foo); =# update foo set a = a+1; Currently, the way that works is that the ModifyTable node knows of three plans, all with different target lists, targeting each of the tables separately. To make $SUBJECT work with inheritance, we would somehow have to get all three different types of tuples through an Append node to the ModifyTable (with ctid and tableoid as junk columns). I can see how the approach Tom and Andres talk about upthread could work, but that's a helluva lot of work just so we can check the inheritance check box for this feature, even though it's not clear anyone would actually want to use this on inheritance sets. Other approach I can think of would be to create some kind of a Frankenstein tuple which would satisfy the needs of all tables in the set, but that sounds really awful. Or we could just forget that we ever had inheritance and assume that partitioning is the only interesting use case. But it's not clear to me that we could assume the output to be the same in every case even then. If someone has a clear idea on how this could be implemented in a reasonable time, we'd be happy to hear about it. .marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Draft release notes are up for 9.3.5
I've committed first-draft release notes into HEAD (only, so far). As in the past couple of release cycles, I've only created a full new documentation section for 9.3.5; tomorrow I will copy final text into the back-branch release-note files, trimming away whatever does not apply to a given branch. With respect to the ongoing discussion about pg_upgrade's handling of datminmxid/relminmxid, I've documented that patch on the assumption that we don't really want to recommend any manual attempts to accomplish the same result. See the thread on -bugs for discussion if you care. Comments, corrections? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_stat_statements cluttered with "DEALLOCATE dbdpg_p*"
On Sun, Jul 20, 2014 at 10:56 AM, Tom Lane wrote: > However, this is certainly a behavioral change. Perhaps squeeze it > into 9.4, but not the back braches? +1 -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_stat_statements cluttered with "DEALLOCATE dbdpg_p*"
If you do not like my normalization hack (I do not like it much either:-), I have suggested to add "&& !IsA(parsetree, DeallocateStmt)" to the condition above, which would ignore DEALLOCATE as PREPARE and EXECUTE are currently and rightfully ignored. Well, EXECUTE isn't actually ignored, but tracked via the execution time. But that doesn't diminish your point with PREPARE. If we do something we should go for the && !IsA(parsetree, DeallocateStmt), not the normalization. The latter is pretty darn bogus. Agreed. I think basically the reasoning here is "since we don't track PREPARE or EXECUTE, we shouldn't track DEALLOCATE either". Yes. It is not just because it is nicely symmetric, it is also annoying to have hundreds of useless DEALLOCATE stats in the table. However, this is certainly a behavioral change. Perhaps squeeze it into 9.4, That would be nice, and the one-liner looks safe enough. but not the back braches? Yep. I doubt that pg_stat_statements users rely on statistics about DEALLOCATE, so back patching the would be quite safe as well, but I would not advocate it. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_stat_statements cluttered with "DEALLOCATE dbdpg_p*"
Andres Freund writes: > On 2014-07-20 17:01:50 +0200, Fabien COELHO wrote: >> If you do not like my normalization hack (I do not like it much either:-), I >> have suggested to add "&& !IsA(parsetree, DeallocateStmt)" to the condition >> above, which would ignore DEALLOCATE as PREPARE and EXECUTE are currently >> and rightfully ignored. > Well, EXECUTE isn't actually ignored, but tracked via the execution > time. But that doesn't diminish your point with PREPARE. If we do > something we should go for the && !IsA(parsetree, DeallocateStmt), not > the normalization. The latter is pretty darn bogus. Agreed. I think basically the reasoning here is "since we don't track PREPARE or EXECUTE, we shouldn't track DEALLOCATE either". However, this is certainly a behavioral change. Perhaps squeeze it into 9.4, but not the back braches? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SQL MERGE is quite distinct from UPSERT
On Sat, Jul 19, 2014 at 09:55:19PM -0700, Peter Geoghegan wrote: > At a high level SQL MERGE is quite distinct from UPSERT, in that it is > a utility command that performs inserts, updates and deletes while > avoiding race conditions (e.g. unique constraint violations) on a more > or less best effort basis. MERGE is conceptually messy. In contrast > UPSERT is actually atomic, and having its behavior be relatively easy > to reason about ought to be the top priority. There is a *really* big > demand for UPSERT from users, not MERGE, although MERGE is certainly > useful too. FWIW, I agree. MERGE is hard enough as it is, but trying to guarentee some kind of atomicity makes it nigh on impossible. Indeed, after reading what you wrote I think it may well be impossible to make it atomic *and* make it perform in the general case. So, +1 UPSERT. Have a nice day, -- Martijn van Oosterhout http://svana.org/kleptog/ > He who writes carelessly confesses thereby at the very outset that he does > not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins
Attached v9 of the patch. Aside from a few minor fixes, the main change is that this is assumed to be combined with the "dense allocation" patch. It also rewrites the ExecHashIncreaseNumBuckets to follow the same pattern as ExecHashIncreaseNumBatches (i.e. scanning chunks directly, instead of buckets). It's cleaner and more consistent. hashjoin-nbuckets-growth-v9.patch contains only this patch, so you need to grab the hashjoin-alloc-v4.patch from a different thread and apply it first) hashjoin-nbuckets-growth-v9-combined.patch contains both patches combined regards Tomas Vondra diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index 781a736..ee3fffb 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -1900,18 +1900,20 @@ show_hash_info(HashState *hashstate, ExplainState *es) if (es->format != EXPLAIN_FORMAT_TEXT) { ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es); + ExplainPropertyLong("Original Hash Buckets", +hashtable->nbuckets_original, es); ExplainPropertyLong("Hash Batches", hashtable->nbatch, es); ExplainPropertyLong("Original Hash Batches", hashtable->nbatch_original, es); ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es); } - else if (hashtable->nbatch_original != hashtable->nbatch) + else if ((hashtable->nbatch_original != hashtable->nbatch) || (hashtable->nbuckets_original != hashtable->nbuckets)) { appendStringInfoSpaces(es->str, es->indent * 2); appendStringInfo(es->str, - "Buckets: %d Batches: %d (originally %d) Memory Usage: %ldkB\n", - hashtable->nbuckets, hashtable->nbatch, - hashtable->nbatch_original, spacePeakKb); + "Buckets: %d (originally %d) Batches: %d (originally %d) Memory Usage: %ldkB\n", + hashtable->nbuckets, hashtable->nbuckets_original, + hashtable->nbatch, hashtable->nbatch_original, spacePeakKb); } else { diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c index 589b2f1..014b6f1 100644 --- a/src/backend/executor/nodeHash.c +++ b/src/backend/executor/nodeHash.c @@ -37,8 +37,10 @@ #include "utils/lsyscache.h" #include "utils/syscache.h" +bool enable_hash_resize = true; static void ExecHashIncreaseNumBatches(HashJoinTable hashtable); +static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable); static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node, int mcvsToUse); static void ExecHashSkewTableInsert(HashJoinTable hashtable, @@ -47,6 +49,10 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable, int bucketNumber); static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable); +static char * chunk_alloc(HashJoinTable hashtable, int tupleSize); + +/* Memory needed for optimal number of buckets. */ +#define BUCKETS_SPACE(htab) ((htab)->nbuckets_optimal * sizeof(void*)) /* * ExecHash @@ -116,6 +122,7 @@ MultiExecHash(HashState *node) /* It's a skew tuple, so put it into that hash table */ ExecHashSkewTableInsert(hashtable, slot, hashvalue, bucketNumber); +hashtable->skewTuples += 1; } else { @@ -126,10 +133,34 @@ MultiExecHash(HashState *node) } } + /* If average number of tuples per bucket is over the defined threshold, + * increase the number of buckets to get below it. */ + if (enable_hash_resize) { + + /* + * Do we actually need to resize? Only scale up, don't scale down (for example + * after adding a batch, it's possible that nbuckets > nbuckets_optimal). + */ + if (hashtable->nbuckets < hashtable->nbuckets_optimal) { + +#ifdef HJDEBUG + printf("Increasing nbuckets %d => %d\n", + hashtable->nbuckets, hashtable->nbuckets_optimal); +#endif + ExecHashIncreaseNumBuckets(hashtable); + } + + } + + /* Account for the buckets in spaceUsed (reported in EXPLAIN ANALYZE) */ + hashtable->spaceUsed += BUCKETS_SPACE(hashtable); + if (hashtable->spaceUsed > hashtable->spacePeak) + hashtable->spacePeak = hashtable->spaceUsed; + /* must provide our own instrumentation support */ if (node->ps.instrument) InstrStopNode(node->ps.instrument, hashtable->totalTuples); - + /* * We do not return the hash table directly because it's not a subtype of * Node, and so would violate the MultiExecProcNode API. Instead, our @@ -223,7 +254,6 @@ ExecEndHash(HashState *node) ExecEndNode(outerPlan); } - /* * ExecHashTableCreate * @@ -239,6 +269,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls) int nbatch; int num_skew_mcvs; int log2_nbuckets; + int log2_nbatch; int nkeys; int i; ListCell *ho; @@ -263,6 +294,10 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls) log2_nbuckets = my_log2(nbuckets); Assert(nbuckets == (1 << log2_nbuckets)); + /* nb
Re: [HACKERS] tweaking NTUP_PER_BUCKET
On 19.7.2014 20:24, Tomas Vondra wrote: > On 13.7.2014 21:32, Tomas Vondra wrote: >> The current patch only implemnents this for tuples in the main >> hash table, not for skew buckets. I plan to do that, but it will >> require separate chunks for each skew bucket (so we can remove it >> without messing with all of them). The chunks for skew buckets >> should be smaller (I'm thinking about ~4kB), because there'll be >> more of them, but OTOH those buckets are for frequent values so the >> chunks should not remain empty. > > I've looked into extending the dense allocation to the skew buckets, > and I think we shouldn't do that. I got about 50% of the changes and > then just threw it out because it turned out quite pointless. > > The amount of memory for skew buckets is limited to 2% of work mem, > so even with 100% overhead it'll use ~4% of work mem. So there's > pretty much nothing to gain here. So the additional complexity > introduced by the dense allocation seems pretty pointless. > > I'm not entirely happy with the code allocating some memory densely > and some using traditional palloc, but it certainly seems cleaner > than the code I had. > > So I think the patch is mostly OK as is. Attached is v4 of the patch, mostly with minor improvements and cleanups (removed MemoryContextStats, code related to skew buckets). regards Tomas diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c index 589b2f1..6a9d956 100644 --- a/src/backend/executor/nodeHash.c +++ b/src/backend/executor/nodeHash.c @@ -47,6 +47,7 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable, int bucketNumber); static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable); +static char * chunk_alloc(HashJoinTable hashtable, int tupleSize); /* * ExecHash @@ -129,7 +130,7 @@ MultiExecHash(HashState *node) /* must provide our own instrumentation support */ if (node->ps.instrument) InstrStopNode(node->ps.instrument, hashtable->totalTuples); - + /* * We do not return the hash table directly because it's not a subtype of * Node, and so would violate the MultiExecProcNode API. Instead, our @@ -223,7 +224,6 @@ ExecEndHash(HashState *node) ExecEndNode(outerPlan); } - /* * ExecHashTableCreate * @@ -294,6 +294,8 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls) hashtable->spaceAllowedSkew = hashtable->spaceAllowed * SKEW_WORK_MEM_PERCENT / 100; + hashtable->chunks_batch = NULL; + /* * Get info about the hash functions to be used for each hash key. Also * remember whether the join operators are strict. @@ -556,10 +558,10 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable) int oldnbatch = hashtable->nbatch; int curbatch = hashtable->curbatch; int nbatch; - int i; MemoryContext oldcxt; long ninmemory; long nfreed; + HashChunk oldchunks = hashtable->chunks_batch; /* do nothing if we've decided to shut off growth */ if (!hashtable->growEnabled) @@ -612,51 +614,70 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable) */ ninmemory = nfreed = 0; - for (i = 0; i < hashtable->nbuckets; i++) - { - HashJoinTuple prevtuple; - HashJoinTuple tuple; + /* we're going to scan through the chunks directly, not buckets, so we can reset the + * buckets and reinsert all the tuples there - just set them to NULL */ + memset(hashtable->buckets, 0, sizeof(void*) * hashtable->nbuckets); - prevtuple = NULL; - tuple = hashtable->buckets[i]; + /* reset the chunks (we have a copy in oldchunks) - this way we'll allocate */ + hashtable->chunks_batch = NULL; - while (tuple != NULL) - { - /* save link in case we delete */ - HashJoinTuple nexttuple = tuple->next; - int bucketno; - int batchno; + /* so, let's scan through the old chunks, and all tuples in each chunk */ + while (oldchunks != NULL) { + + /* pointer to the next memory chunk (may be NULL for the last chunk) */ + HashChunk nextchunk = oldchunks->next; + + /* position within the buffer (up to oldchunks->used) */ + size_t idx = 0; + + /* process all tuples stored in this chunk (and then free it) */ + while (idx < oldchunks->used) { + + /* get the hashjoin tuple and mintuple stored right after it */ + HashJoinTuple hashTuple = (HashJoinTuple)(oldchunks->data + idx); + MinimalTuple tuple = HJTUPLE_MINTUPLE(hashTuple); + + int hashTupleSize = (HJTUPLE_OVERHEAD + tuple->t_len); + + int bucketno; + int batchno; ninmemory++; - ExecHashGetBucketAndBatch(hashtable, tuple->hashvalue, + ExecHashGetBucketAndBatch(hashtable, hashTuple->hashvalue, &bucketno, &batchno); - Assert(bucketno == i); + if (batchno == curbatch) { -/* keep tuple */ -prevtuple = tuple; +/* keep tuple - this means we need to copy it into the new chunks */ +HashJoinTuple copyTuple = (H
Re: [HACKERS] pg_stat_statements cluttered with "DEALLOCATE dbdpg_p*"
[...]. If we do something we should go for the && !IsA(parsetree, DeallocateStmt), not the normalization. Ok. The latter is pretty darn bogus. Yep:-) I'm fine with ignoring DEALLOCATE altogether. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_stat_statements cluttered with "DEALLOCATE dbdpg_p*"
On 2014-07-20 17:01:50 +0200, Fabien COELHO wrote: > > >That's because PREPARE isn't executed as it's own statement, but done on > >the protocol level (which will need noticeably fewer messages). There's > >no builtin logic to ignore actual PREPARE statements. > > ISTM that there is indeed a special handling in function pgss_ProcessUtility > for PREPARE and EXECUTE: > > /* >* If it's an EXECUTE statement, we don't track it and don't increment > the >* nesting level. This allows the cycles to be charged to the > underlying >* PREPARE instead (by the Executor hooks), which is much more useful. >* >* We also don't track execution of PREPARE. If we did, we would get > one >* hash table entry for the PREPARE (with hash calculated from the query >* string), and then a different one with the same query string (but > hash >* calculated from the query tree) would be used to accumulate costs of >* ensuing EXECUTEs. This would be confusing, and inconsistent with > other >* cases where planning time is not included at all. >*/ > if (pgss_track_utility && pgss_enabled() && > !IsA(parsetree, ExecuteStmt) && > !IsA(parsetree, PrepareStmt)) > ... standard handling ... > else > ... special "no" handling ... > > >So I don't think your consistency argument counts as much here. > > I think that given the above code, my argument stands reasonably. Ick. You have a point. > If you do not like my normalization hack (I do not like it much either:-), I > have suggested to add "&& !IsA(parsetree, DeallocateStmt)" to the condition > above, which would ignore DEALLOCATE as PREPARE and EXECUTE are currently > and rightfully ignored. Well, EXECUTE isn't actually ignored, but tracked via the execution time. But that doesn't diminish your point with PREPARE. If we do something we should go for the && !IsA(parsetree, DeallocateStmt), not the normalization. The latter is pretty darn bogus. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_stat_statements cluttered with "DEALLOCATE dbdpg_p*"
That's because PREPARE isn't executed as it's own statement, but done on the protocol level (which will need noticeably fewer messages). There's no builtin logic to ignore actual PREPARE statements. ISTM that there is indeed a special handling in function pgss_ProcessUtility for PREPARE and EXECUTE: [...] For completeness purpose, here is the one-liner patch to handle DEALLOCATE as PREPARE & EXECUTE are handled. It is cleaner than the other one, but then DEALLOCATE disappear from the table, as PREPARE and EXECUTE do. -- Fabien.diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index 07f09e1..a8000cb 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -955,10 +955,14 @@ pgss_ProcessUtility(Node *parsetree, const char *queryString, * calculated from the query tree) would be used to accumulate costs of * ensuing EXECUTEs. This would be confusing, and inconsistent with other * cases where planning time is not included at all. + * + * DEALLOCATE of prepared statement is skip, especially as some drivers + * use process/counter-based names which end up cluttering the stats table. */ if (pgss_track_utility && pgss_enabled() && !IsA(parsetree, ExecuteStmt) && - !IsA(parsetree, PrepareStmt)) + !IsA(parsetree, PrepareStmt) && + !IsA(parsetree, DeallocateStmt)) { instr_time start; instr_time duration; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_stat_statements cluttered with "DEALLOCATE dbdpg_p*"
That's because PREPARE isn't executed as it's own statement, but done on the protocol level (which will need noticeably fewer messages). There's no builtin logic to ignore actual PREPARE statements. ISTM that there is indeed a special handling in function pgss_ProcessUtility for PREPARE and EXECUTE: /* * If it's an EXECUTE statement, we don't track it and don't increment the * nesting level. This allows the cycles to be charged to the underlying * PREPARE instead (by the Executor hooks), which is much more useful. * * We also don't track execution of PREPARE. If we did, we would get one * hash table entry for the PREPARE (with hash calculated from the query * string), and then a different one with the same query string (but hash * calculated from the query tree) would be used to accumulate costs of * ensuing EXECUTEs. This would be confusing, and inconsistent with other * cases where planning time is not included at all. */ if (pgss_track_utility && pgss_enabled() && !IsA(parsetree, ExecuteStmt) && !IsA(parsetree, PrepareStmt)) ... standard handling ... else ... special "no" handling ... So I don't think your consistency argument counts as much here. I think that given the above code, my argument stands reasonably. If you do not like my normalization hack (I do not like it much either:-), I have suggested to add "&& !IsA(parsetree, DeallocateStmt)" to the condition above, which would ignore DEALLOCATE as PREPARE and EXECUTE are currently and rightfully ignored. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tweaking NTUP_PER_BUCKET
On 20.7.2014 00:12, Tomas Vondra wrote: > On 19.7.2014 23:07, Tomas Vondra wrote: >> On 19.7.2014 20:28, Tomas Vondra wrote: >> For the first case, a WARNING at the end of estimate_hash_bucketsize >> says this: >> >> WARNING: nbuckets=8388608.00 estfract=0.01 >> WARNING: nbuckets=65536.00 estfract=0.000267 >> >> There are 4.3M rows in the big_table, so it says ~4 tuples (selectivity >>> = 1e-6), and ~10 tuples/bucket for the small_table (42k rows). >> >> For the second case, I get this: >> >> WARNING: nbuckets=16777216.00 estfract=0.01 >> WARNING: nbuckets=262144.00 estfract=0.000100 >> >> i.e. ~11 tuples/bucket for big_table and ~20 tuples for small_table. >> >> That sounds really strange, because small_table in the second test case >> is almost perfectly unique. And in the first test case, there are only >> very few keys with >2 occurences. > > FWIW, the "problem" seems to be this: > > /* >* Adjust estimated bucketsize upward to account for skewed >* distribution. */ > > if (avgfreq > 0.0 && mcvfreq > avgfreq) > estfract *= mcvfreq / avgfreq; > > Which is explained like in the estimate_hash_bucketsize comment like this: > > * be nonuniformly occupied. If the other relation in the join has a key > * distribution similar to this one's, then the most-loaded buckets are > * exactly those that will be probed most often. Therefore, the "average" > * bucket size for costing purposes should really be taken as something * > close to the "worst case" bucket size. We try to estimate this by > * adjusting the fraction if there are too few distinct data values, and > * then scaling up by the ratio of the most common value's frequency to > * the average frequency. After thinking about this a bit more, I'm starting to question the logic behind the adjustment. The thing is, with NTUP_PER_BUCKET=1, all the tuples in a bucket should have the same hash value and (hash collisions aside) the same join key value. That means we're not really searching through the bucket, we're merely iterating and returning all the tuples. With NTUP_PER_BUCKET=10 it maybe made sense, because it was expected to see multiple hash values in a single bucket. With NTUP_PER_BUCKET=1 we shouldn't really expect that. So I think we should rip the MCV adjustment out ouf estimate_hash_bucketsize (and indeed, it fixed both the test cases). The only case where I think this might play role is when we can't use the optimal number of buckets (achieving NTUP_PER_BUCKET=1), but in that case the patched ExecChooseHashTableSize prefers to increase number of batches. So that seems safe to me. Or in what scenario (with NTUP_PER_BUCKET=1) does this still make sense? Two more comments: (a) Isn't the minimum selectivity enforced in estimate_hash_bucketsize (1.0e-6) too high? I mean, With 1GB work_mem I can get ~20M tuples into the hash table, and in that case it's off by an order of magnitude. Maybe 1e-9 would be more appropriate? (b) Currently, ExecChooseHashTableSize sizes the hash table (nbuckets) using ntuples, but maybe ndistinct (if available) would be better? It's tricky, though, because while both values are initially estimates, ndistinct is notoriously less reliable. Also, we currently don't have means to count ndistinct while building the table (as opposed to ntuples, which is trivial), which is needed when resizing the hash table in case the initial estimate was off. So currently the code optimizes for "ndistinct==ntuples" which may produce larger nbuckets values, but ISTM it's the right choice. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_stat_statements cluttered with "DEALLOCATE dbdpg_p*"
On 2014-07-20 14:43:27 +0200, Fabien COELHO wrote: > >a) Consider using the extended query protocol. > >b) consider using unnamed prepared statements to reduce the number of > > roundtrips > >c) wonder why PREPARE/DEALLOCATE are so much more frequent than the > > actualy query execution. > > (1) I'm not responsible for DBD::Pg allocating "random" names to prepared > statements, even if the queries are the same, and that accumulate over time > (weeks, possibly months). > > (2) pg_stat_statements is currently inconsistent anyway, as PREPARE is not > counted (but the underlying query is on each EXECUTE), although its > corresponding DEALLOCATE is counted, so I think that something is needed for > consistency. That's because PREPARE isn't executed as it's own statement, but done on the protocol level (which will need noticeably fewer messages). There's no builtin logic to ignore actual PREPARE statements. So I don't think your consistency argument counts as much here. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_stat_statements cluttered with "DEALLOCATE dbdpg_p*"
Hello Andres, Why isn't the driver using the extended query protocol? Sending PREPARE/EXECUTE/DEALLOCATE wastes roundtrips... It seems to me that it would be more helful if these similar entries where aggregated together, that is if the query "normalization" could ignore the name of the descriptor. I'm not in favor of this. If there's DEALLOCATEs that are frequent enough to drown out other entries you should Thanks for the advice. I'm not responsible for the application nor the driver, and I think that pg_stat_statements should be consistent and reasonable independently of drivers and applications. a) Consider using the extended query protocol. b) consider using unnamed prepared statements to reduce the number of roundtrips c) wonder why PREPARE/DEALLOCATE are so much more frequent than the actualy query execution. (1) I'm not responsible for DBD::Pg allocating "random" names to prepared statements, even if the queries are the same, and that accumulate over time (weeks, possibly months). (2) pg_stat_statements is currently inconsistent anyway, as PREPARE is not counted (but the underlying query is on each EXECUTE), although its corresponding DEALLOCATE is counted, so I think that something is needed for consistency. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_stat_statements cluttered with "DEALLOCATE dbdpg_p*"
On 2014-07-20 14:06, Andres Freund wrote: On 2014-07-20 13:54:01 +0200, Andres Freund wrote: On 2014-04-01 16:45:29 +0200, Fabien COELHO wrote: I noticed that my pg_stat_statements is cluttered with hundreds of entries like "DEALLOCATE dbdpg_p123456_7", occuring each only once. Why isn't the driver using the extended query protocol? Sending PREPARE/EXECUTE/DEALLOCATE wastes roundtrips... Hm. It's probably because libqp doesn't expose sending Close message for prepared statements :(. No idea why. Yeah, I always considered that a missing feature, and even wrote a patch to add it at some point. I wonder what happened to it. .marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_stat_statements cluttered with "DEALLOCATE dbdpg_p*"
On 2014-07-20 13:54:01 +0200, Andres Freund wrote: > Hi, > > On 2014-04-01 16:45:29 +0200, Fabien COELHO wrote: > > I noticed that my pg_stat_statements is cluttered with hundreds of entries > > like "DEALLOCATE dbdpg_p123456_7", occuring each only once. > > Why isn't the driver using the extended query protocol? Sending > PREPARE/EXECUTE/DEALLOCATE wastes roundtrips... Hm. It's probably because libqp doesn't expose sending Close message for prepared statements :(. No idea why. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_stat_statements cluttered with "DEALLOCATE dbdpg_p*"
Hi, On 2014-04-01 16:45:29 +0200, Fabien COELHO wrote: > I noticed that my pg_stat_statements is cluttered with hundreds of entries > like "DEALLOCATE dbdpg_p123456_7", occuring each only once. Why isn't the driver using the extended query protocol? Sending PREPARE/EXECUTE/DEALLOCATE wastes roundtrips... > It seems to me that it would be more helful if these similar entries where > aggregated together, that is if the query "normalization" could ignore the > name of the descriptor. I'm not in favor of this. If there's DEALLOCATEs that are frequent enough to drown out other entries you should a) Consider using the extended query protocol. b) consider using unnamed prepared statements to reduce the number of roundtrips c) wonder why PREPARE/DEALLOCATE are so much more frequent than the actualy query execution. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_stat_statements cluttered with "DEALLOCATE dbdpg_p*"
Hello devs, I noticed that my pg_stat_statements is cluttered with hundreds of entries like "DEALLOCATE dbdpg_p123456_7", occuring each only once. Here is a patch and sql test file to: * normalize DEALLOCATE utility statements in pg_stat_statements Some drivers such as DBD:Pg generate process/counter-based identifiers for PREPARE, which result in hundreds of DEALLOCATE being tracked, although the prepared query may be the same. This is also consistent with the corresponding PREPARE not being tracked (although the underlying prepared query *is* tracked). ** Note **: another simpler option would be to skip deallocates altogether by inserting a "&& !IsA(parsetree, DeallocateStmt)" at the beginning of pgss_ProcessUtility(). I'm not sure what is best. -- Fabien.diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index 07f09e1..4b3348a 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -966,10 +966,19 @@ pgss_ProcessUtility(Node *parsetree, const char *queryString, BufferUsage bufusage_start, bufusage; uint32 queryId; + const char *normalQueryString; bufusage_start = pgBufferUsage; INSTR_TIME_SET_CURRENT(start); + /* + * normalize queryString for DEALLOCATE, as drivers such as DBD::Pg + * generate counter-based names for prepared statements which can + * result in many useless entries. + */ + normalQueryString = strncasecmp(queryString, "DEALLOCATE", 10)==0 ? + "DEALLOCATE ?;": queryString; + nested_level++; PG_TRY(); { @@ -1025,9 +1034,9 @@ pgss_ProcessUtility(Node *parsetree, const char *queryString, INSTR_TIME_SUBTRACT(bufusage.blk_write_time, bufusage_start.blk_write_time); /* For utility statements, we just hash the query string directly */ - queryId = pgss_hash_string(queryString); + queryId = pgss_hash_string(normalQueryString); - pgss_store(queryString, + pgss_store(normalQueryString, queryId, INSTR_TIME_GET_MILLISEC(duration), rows, pgss-norm-deallocate.sql Description: application/sql -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Built-in binning functions
On 16 July 2014 20:35, Pavel Stehule wrote: > > > > 2014-07-16 10:04 GMT+02:00 Petr Jelinek : > >> On 08/07/14 02:14, Tom Lane wrote: >>> >>> Petr Jelinek writes: here is a patch implementing varwidth_bucket (naming is up for discussion) function which does binning with variable bucket width. >>> >>> >>> I didn't see any discussion of the naming question in this thread. >>> I'd like to propose that it should be just "width_bucket()"; we can >>> easily determine which function is meant, considering that the >>> SQL-spec variants don't take arrays and don't even have the same >>> number of actual arguments. >> >> >> I did mention in submission that the names are up for discussion, I am all >> for naming it just width_bucket. > > > I had this idea too - but I am not sure if it is good idea. A distance > between ANSI SQL with_bucket and our enhancing is larger than in our > implementation of "median" for example. > > I can live with both names, but current name I prefer. Hmmm, not sure. Let's look around and think what words people use. Transforms of this type are referred to as discretization in formal literature and as binning in commong usage/statistical software. width_bucket() seems to refer to an equal-width binning process. The function being discussed here is a generic mechanism, the boundaries of which could have been decided using equal-frequency or other mechanisms. Using the word "width" in those contexts could be confusing. Given width_bucket() is already in use for SQL Standard function, I agree it would be confusing to have same name for different parameter profiles. So I suggest that we use the more generic function name bin(), with a second choice of discretize() (though that seems annoyingly easy to spell incorrectly) Whatever we do, it seems clear we need a section in the manual to describe Statistical Functions, including width_bucket(), whatever we call this function and including the terms bin, binning, transform, discretize and discretization to ensure those are easily searchable. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Production block comparison facility
The block comparison facility presented earlier by Heikki would not be able to be used in production systems. ISTM that it would be desirable to have something that could be used in that way. ISTM easy to make these changes * optionally generate a FPW for every WAL record, not just first change after checkpoint full_page_writes = 'always' * when an FPW arrives, optionally run a check to see if it compares correctly against the page already there, when running streaming replication without a recovery target. We could skip reporting any problems until the database is consistent full_page_write_check = on The above changes seem easy to implement. With FPW compression, this would be a usable feature in production. Comments? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers