Re: [HACKERS] jsonb format is pessimal for toast compression
On Fri, Aug 15, 2014 at 8:19 PM, Tom Lane t...@sss.pgh.pa.us wrote: Arthur Silva arthur...@gmail.com writes: We should add some sort of versionning to the jsonb format. This can be explored in the future in many ways. If we end up making an incompatible change to the jsonb format, I would support taking the opportunity to stick a version ID in there. But I don't want to force a dump/reload cycle *only* to do that. As for the current problem, we should explore the directory at the end option. It should improve compression and keep good access performance. Meh. Pushing the directory to the end is just a band-aid, and since it would still force a dump/reload, it's not a very enticing band-aid. The only thing it'd really fix is the first_success_by issue, which we could fix *without* a dump/reload by using different compression parameters for jsonb. Moving the directory to the end, by itself, does nothing to fix the problem that the directory contents aren't compressible --- and we now have pretty clear evidence that that is a significant issue. (See for instance Josh's results that increasing first_success_by did very little for the size of his dataset.) I think the realistic alternatives at this point are either to switch to all-lengths as in my test patch, or to use the hybrid approach of Heikki's test patch. IMO the major attraction of Heikki's patch is that it'd be upward compatible with existing beta installations, ie no initdb required (but thus, no opportunity to squeeze in a version identifier either). It's not showing up terribly well in the performance tests I've been doing --- it's about halfway between HEAD and my patch on that extract-a-key-from-a-PLAIN-stored-column test. But, just as with my patch, there are things that could be done to micro-optimize it by touching a bit more code. I did some quick stats comparing compressed sizes for the delicio.us data, printing quartiles as per Josh's lead: all-lengths {440,569,609,655,1257} Heikki's patch {456,582,624,671,1274} HEAD{493,636,684,744,1485} (As before, this is pg_column_size of the jsonb within a table whose rows are wide enough to force tuptoaster.c to try to compress the jsonb; otherwise many of these values wouldn't get compressed.) These documents don't have enough keys to trigger the first_success_by issue, so that HEAD doesn't look too awful, but still there's about an 11% gain from switching from offsets to lengths. Heikki's method captures much of that but not all. Personally I'd prefer to go to the all-lengths approach, but a large part of that comes from a subjective assessment that the hybrid approach is too messy. Others might well disagree. In case anyone else wants to do measurements on some more data sets, attached is a copy of Heikki's patch updated to apply against git tip. regards, tom lane I agree that versioning might sound silly at this point, but lets keep it in mind. Row level compression is very slow itself, so it sounds odd to me paying 25% performance penalty everywhere for the sake of having better compression ratio in the dictionary area. Consider, for example, an optimization that stuffs integers (up to 28 bits) inside the JEntry itself. That alone would save 8 bytes for each integer.
Re: [HACKERS] pg_receivexlog --status-interval add fsync feedback
On Wed, Aug 13, 2014 at 5:55 PM, furu...@pm.nttdata.co.jp wrote: I don't think that it's good idea to control that behavior by using --status-interval. I'm sure that there are some users who both want that behavior and want set the maximum interval between a feedback is sent back to the server because these settings are available in walreceiver. But your version of --status-interval doesn't allow those settings at all. That is, if set to -1, the maximum interval between each feedback cannot be set. OTOH, if set to the positive number, feedback-as-soon-as-fsync behavior cannot be enabled. Therefore, I'm thinking that it's better to introduce new separate option for that behavior. Thanks for the review! This patch was split option as you pointed out. If -r option is specified, status packets sent to server as soon as after fsync. Otherwise to send server status packet in the spacing of the --status-interval. Hi, I took a look at this patch. I applied patch to master successfully, and did not get error with compiling. Also it works fine. One question is why reply_fsync is defined as volatile variable? Sorry I could not understand reason of that. Currently patch modifies argument of some function (e.g., Handle CopyStream, Process LogDate Msg), and add the similar code to each function. I don't think it is good approach. For example, I think that we should gather these code into one function. Thought? Regards, --- Sawada Masahiko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: PENDING_LIST_CLEANUP_SIZE - maximum size of GIN pending list Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index
On Fri, Aug 8, 2014 at 11:45 PM, Fujii Masao masao.fu...@gmail.com wrote: On Fri, Aug 8, 2014 at 11:43 PM, Fujii Masao masao.fu...@gmail.com wrote: On Mon, Apr 14, 2014 at 11:40 PM, Tom Lane t...@sss.pgh.pa.us wrote: Fujii Masao masao.fu...@gmail.com writes: On Tue, Apr 1, 2014 at 1:41 AM, Robert Haas robertmh...@gmail.com wrote: Should we try to install some hack around fastupdate for 9.4? I fear the divergence between reasonable values of work_mem and reasonable sizes for that list is only going to continue to get bigger. I'm sure there's somebody out there who has work_mem = 16GB, and stuff like 263865a48973767ce8ed7b7788059a38a24a9f37 is only going to increase the appeal of large values. Controlling the threshold of the size of pending list only by GUC doesn't seem reasonable. Users may want to increase the threshold only for the GIN index which can be updated heavily, and decrease it otherwise. So I think that it's better to add new storage parameter for GIN index to control the threshold, or both storage parameter and GUC. Yeah, -1 for a GUC. A GIN-index-specific storage parameter seems more appropriate. The attached patch introduces the GIN index storage parameter PENDING_LIST_CLEANUP_SIZE which specifies the maximum size of GIN pending list. If it's not set, work_mem is used as that maximum size, instead. So this patch doesn't break the existing application which currently uses work_mem as the threshold of cleanup operation of the pending list. It has only not to set PENDING_LIST_CLEANUP_SIZE. This is an index storage parameter, and allows us to specify each threshold per GIN index. So, for example, we can increase the threshold only for the GIN index which can be updated heavily, and decrease it otherwise. This patch uses another patch that I proposed (*1) as an infrastructure. Please apply that infrastructure patch first if you apply this patch. (*1) http://www.postgresql.org/message-id/CAHGQGwEanQ_e8WLHL25=bm_8z5zkyzw0k0yir+kdmv2hgne...@mail.gmail.com Regards, -- Fujii Masao Sorry, I forgot to attached the patch This time, attached. I think that this patch should be rebased. It contains garbage code. Regards, --- Sawada Masahiko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal to add a QNX 6.5 port to PostgreSQL
Nice algorithm. On Fri, Aug 15, 2014 at 02:16:08PM -0400, Robert Haas wrote: On Fri, Aug 15, 2014 at 12:02 PM, Tom Lane t...@sss.pgh.pa.us wrote: * We talked about combining this technique with a plain file lock so that we would have belt-and-suspenders protection, in particular something that would have a chance of working across NFS clients. This would suggest leaving the fcntl lock in place, ie, don't do step 11, and also that the file-to-be-locked *not* have any other purpose (which would only increase the risk of losing the lock through careless open/close). I'd be afraid that a secondary mechanism that mostly-but-not-really works could do more harm by allowing us to miss bugs in the primary, pipe-based locking mechanism than the good it would accomplish. Users do corrupt their NFS- and GFS2-hosted databases today. I would rather have each process hold only an fcntl() lock than hold only the FIFO file descriptor. There's no such dichotomy, so let's have both. Meh. Do we really want to allow a new postmaster to start if there are any processes remaining that were launched by backends? I'd be inclined to just suppress close-on-exec, period. Seems like a pretty weird and artificial restriction. Anything that has done exec() will not be connected to shared memory, so it really doesn't matter whether it's still alive or not. People can and do write extensions that launch processes from PostgreSQL backends via fork()+exec(), and we've taken pains in the past not to break such cases. I don't see a reason to impose now (for no data-integrity-related reason) the rule that any such processes must not be daemons. +1 -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench --tuple-size option
The custom initialization is to run a manual ALTER after the initialization. Sure, it can be done this way. I'm not sure about the implication of ALTER on the table storage, Should be fine in this case. After some testing and laughing, my conclusion is not fine at all. The filler attributes in pgbench are by default EXTENDED, which mean possibly compressed... As the the default value is '', the compression, when tried for large sizes, performs very well, and the performance is the same as with a (declared) smaller tuple:-) Probably not the intention of the benchmark designer. Conclusion: I need an ALTER TABLE anyway to change the STORAGE. Or maybe pgbench should always do it anyway... Conclusion 2: I've noted the submission as rejected as both you and Fujii don't like it, and although I found it useful, but I can do without it quite easily. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] option -T in pg_basebackup doesn't work on windows
On Fri, Aug 15, 2014 at 1:03 PM, MauMau maumau...@gmail.com wrote: Thank you. The code looks correct. I confirmed that the pg_basebackup could relocate the tablespace directory on Windows. I marked this patch as ready for committer. Thanks for the review. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins
On 12.8.2014 00:30, Tomas Vondra wrote: On 11.8.2014 20:25, Robert Haas wrote: It also strikes me that when there's only 1 batch, the set of bits that map onto the batch number is zero-width, and one zero-width bit range is as good as another. In other words, if you're only planning to do one batch, you can easily grow the number of buckets on the fly. Growing the number of buckets only becomes difficult once you have more than one batch. ... I was considering using reversing the bits of the hash, because that's pretty much the simplest solution. But I think you're right it might actually work like this: * Are more batches needed? (yes) = just use nbuckets = my_log2(work_mem / tuple_size) (no) = go ahead, until processing all tuples or hitting work_mem (work_mem) = meh, use the same nbuckets above (all tuples) = compute optimal nbuckets / resize But I need to think about this a bit. So far it seems to me there's no way additional batches might benefit from increasing nbuckets further. I think this is a simple and solid solution, solving the batchno computation issues quite nicely. Attached is v10 patch (bare and combined with the dense allocation), that does this: 1) when we know we'll need batching, buckets are sized for full work_mem (using the estimated tuple width, etc.) 2) without the batching, we estimate the 'right number of buckets' for the estimated number of tuples, and keep track of the optimal number as tuples are added to the hash table - if we discover we need to start batching, we keep the current optimal value (which should be the same as the max number of buckets) and don't mess with it anymore (making it possible to compute batch IDs just like before) - also, on the fist rebatch (nbatch=1 = nbatch=2) the hash table is resized as part of the rebatch - if the hash build completes without batching, we do the resize I believe the patch is pretty much perfect. I plan to do more thorough testing on a wide range of queries in the next few days. I also removed the 'enable_hash_resize' GUC, because it would be more complex to implement this properly after doing the resize as part of rebatch etc.. So either it would make the patch more complex, or it wouldn't do what the name promises. regards Tomas diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index 781a736..ee3fffb 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -1900,18 +1900,20 @@ show_hash_info(HashState *hashstate, ExplainState *es) if (es-format != EXPLAIN_FORMAT_TEXT) { ExplainPropertyLong(Hash Buckets, hashtable-nbuckets, es); + ExplainPropertyLong(Original Hash Buckets, +hashtable-nbuckets_original, es); ExplainPropertyLong(Hash Batches, hashtable-nbatch, es); ExplainPropertyLong(Original Hash Batches, hashtable-nbatch_original, es); ExplainPropertyLong(Peak Memory Usage, spacePeakKb, es); } - else if (hashtable-nbatch_original != hashtable-nbatch) + else if ((hashtable-nbatch_original != hashtable-nbatch) || (hashtable-nbuckets_original != hashtable-nbuckets)) { appendStringInfoSpaces(es-str, es-indent * 2); appendStringInfo(es-str, - Buckets: %d Batches: %d (originally %d) Memory Usage: %ldkB\n, - hashtable-nbuckets, hashtable-nbatch, - hashtable-nbatch_original, spacePeakKb); + Buckets: %d (originally %d) Batches: %d (originally %d) Memory Usage: %ldkB\n, + hashtable-nbuckets, hashtable-nbuckets_original, + hashtable-nbatch, hashtable-nbatch_original, spacePeakKb); } else { diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c index 589b2f1..a5d4812 100644 --- a/src/backend/executor/nodeHash.c +++ b/src/backend/executor/nodeHash.c @@ -37,8 +37,8 @@ #include utils/lsyscache.h #include utils/syscache.h - static void ExecHashIncreaseNumBatches(HashJoinTable hashtable); +static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable); static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node, int mcvsToUse); static void ExecHashSkewTableInsert(HashJoinTable hashtable, @@ -47,6 +47,10 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable, int bucketNumber); static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable); +static char * chunk_alloc(HashJoinTable hashtable, int tupleSize); + +/* Memory needed for optimal number of buckets. */ +#define BUCKETS_SPACE(htab) ((htab)-nbuckets_optimal * sizeof(HashJoinTuple)) /* * ExecHash @@ -116,6 +120,7 @@ MultiExecHash(HashState *node) /* It's a skew tuple, so put it into that hash table */ ExecHashSkewTableInsert(hashtable, slot, hashvalue, bucketNumber); +hashtable-skewTuples += 1; } else { @@ -126,10 +131,30 @@
Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg
On 10.8.2014 22:50, Jeff Davis wrote: On Fri, 2014-08-08 at 01:16 -0700, Jeff Davis wrote: Either way, it's better to be conservative. Attached is a version of the patch with opt-in memory usage tracking. Child contexts inherit the setting from their parent. There was a problem with the previous patch: the parent was unlinked before the delete_context method was called, which meant that the parent's memory was not being decremented. Attached is a fix. It would be simpler to just reorder the operations in MemoryContextDelete, but there is a comment warning against that. So, I pass the old parent as an argument to the delete_context method. Hi, I did a quick check of the patch, mostly because I wasn't sure whether it allows accounting for a selected subtree only (e.g. aggcontext and it's children). And yes, it does exactly that, which is great. Also, the code seems fine to me, except maybe for this piece in AllocSetDelete: /* * Parent is already unlinked from context, so can't use * update_allocation(). */ while (parent != NULL) { parent-total_allocated -= context-total_allocated; parent = parent-parent; } I believe this should check parent-track_mem, just like update_allocation, because this way it walks all the parent context up to the TopMemoryContext. It does not really break anything (because parents without enabled tracking don't report allocated space), but for plans creating/destroying a lot of contexts (say, GroupAggregate with a lot of groups) this might unnecessarily add overhead. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] replication commands and log_statements
On Thu, Aug 14, 2014 at 7:19 PM, Stephen Frost sfr...@snowman.net wrote: * Amit Kapila (amit.kapil...@gmail.com) wrote: On Thu, Aug 14, 2014 at 5:56 AM, Stephen Frost sfr...@snowman.net wrote: Regarding this, I'm generally in the camp that says to just include it in 'all' and be done with it- for now. Okay, but tomorrow if someone wants to implement a patch to log statements executed through SPI (statements inside functions), then what will be your suggestion, does those also can be allowed to log with 'all' option or you would like to suggest him to wait for a proper auditing system? No, I'd suggest having a different option for it as that would be a huge change for people who are already doing 'all', potentially. Not only that, I think another important reason is that those represent separate set of functionality and some users could be interested to log those statements alone or along with other set of commands. Adding the replication commands is extremely unlikely to cause individuals who are already logging 'all' any problems, as far as I can tell. As 'all' is superset for all commands, so anyway it would have been logged under 'all', the point is that whether replication commands can be considered to have a separate log level parameter. To me, it appears that it deserves a separate log level parameter even though today number of commands which fall under this category are less, however it can increase tomorrow. Also if tomorrow there is a consensus on how to extend log_statement parameter, it is quite likely that someone will come up with a patch to consider replication commands as separate log level. I think ideally it would have been better if we could have logged replication commands under separate log_level, but as still there is no consensus on extending log_statement and nobody is even willing to pursue, it seems okay to go ahead and log these under 'all' level. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg
Tomas Vondra t...@fuzzy.cz writes: I believe this should check parent-track_mem, just like update_allocation, because this way it walks all the parent context up to the TopMemoryContext. TBH, I don't think this opt-in tracking business has been thought through even a little bit; for example, what happens if there is a declared-to-not-track context in between levels that are declared to track? And it certainly has not been proven that the extra complexity actually saves cycles rather than the reverse. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg
On 16.8.2014 20:00, Tom Lane wrote: Tomas Vondra t...@fuzzy.cz writes: I believe this should check parent-track_mem, just like update_allocation, because this way it walks all the parent context up to the TopMemoryContext. TBH, I don't think this opt-in tracking business has been thought through even a little bit; for example, what happens if there is a declared-to-not-track context in between levels that are declared to track? And it certainly has not been proven that the extra complexity actually saves cycles rather than the reverse. AFAIK there's no other built-in accounting mechanism at the moment. That's one of the reasons why some of the nodes had to invent their own custom accounting - hashjoin is an example of that. We do have MemoryContextStats, which essentially walks the memory context hierarchy and accounts all the important information. While working on the hashjoin patch, I was thinking that maybe we could create MemoryContextStats clone, computing the accounting info on request. So I did an experiment, writing this new MemoryContextStats clone and rewriting the nodeHash.c to use this instead of custom tracking. The impact was consistently ~2-3%, for the hashjoin queries I measured (e.g. 9100 ms instead of 8900 ms etc.). I just tried the same with the hashjoin patch applied. With the built-in accounting the queries now run ~2.5x faster - around 3500ms. With the MemoryContextStats clone, it takes a whopping 18000 ms. I don't know what causes such increase (I assume I use the accounting info within a loop or something). The point is that with Jeff's patch, it goes down to 3500 ms, with pretty much zero overhead. I'll check why the patched hashjoin is so much slower, but even if I can optimize it somehow I don't think I'll get it to 0 overhead. So I'd say this accounting actally saves the cycles. Of course, that does not prove it's well a well thought-out design, although it seems sensible to me. It's the least complex accounting implementation I can think of. For example, I don't think being able to create nonsensical hierarchies is a major flaw. We certainly should not do that in our code, and we should take reasonable precautions to prevent this in a user code (as for example inheriting the tracking flag in MemoryContextCreate, as Jeff does). Can this be seen as a foot gun? Maybe, but we're handing ammunition to users on a daily basis - for example they can switch to TopMemoryContext, but that does not make memory contexts bad. But maybe the inheritance really is not necessary - maybe it would be enough to track this per-context, and then just walk through the contexts and collect this. Because my observation is that most of the time is actually spent in walking through blocks and freelists. Then the issues with nonsensical hierarchies would disappear, and also the possible overhead with updating all parent hierarchies would go away (not sure if it's even measurable, though). regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers