Re: Default setting for enable_hashagg_disk
On Mon, Jul 27, 2020 at 5:55 PM Peter Geoghegan wrote: > Pushed the hashagg_avoid_disk_plan patch -- thanks! Pushed the hash_mem_multiplier patch as well -- thanks again! As I've said before, I am not totally opposed to adding a true escape hatch. That has not proven truly necessary just yet. For now, my working assumption is that the problem on the table has been addressed. -- Peter Geoghegan
Re: Default setting for enable_hashagg_disk
On Sat, 2020-07-25 at 17:52 -0700, Peter Geoghegan wrote: > BTW, your HLL patch ameliorates the problem with my extreme "sorted > vs > random input" test case from this morning [1] (the thing that I just > discussed with Tomas). Without the HLL patch the sorted case had 2424 > batches. With the HLL patch it has 20. That at least seems like a > notable improvement. Committed. Though I did notice some overhead for spilled-but-still-in-memory cases due to addHyperLogLog() itself. It seems that it can be mostly eliminated with [1], though I'll wait to see if there's an objection because that would affect other users of HLL. Regards, Jeff Davis [1] https://www.postgresql.org/message-id/17068336d300fab76dd6131cbe1996df450dde38.ca...@j-davis.com
Re: Default setting for enable_hashagg_disk
On Mon, Jul 27, 2020 at 5:10 PM Jeff Davis wrote: > I noticed that one of the conditionals, "cheapest_total_path != NULL", > was already redundant with the outer conditional before your patch. I > guess that was just a mistake which your patch corrects along the way? Makes sense. > Anyway, the patch looks good to me. We can have a separate discussion > about pessimizing the costing, if necessary. Pushed the hashagg_avoid_disk_plan patch -- thanks! -- Peter Geoghegan
Re: Default setting for enable_hashagg_disk
On Mon, 2020-07-27 at 11:30 -0700, Peter Geoghegan wrote: > The v4-0001-Remove-hashagg_avoid_disk_plan-GUC.patch changes are > surprisingly complicated. It would be nice if you could take a look > at > that aspect (or confirm that it's included in your review). I noticed that one of the conditionals, "cheapest_total_path != NULL", was already redundant with the outer conditional before your patch. I guess that was just a mistake which your patch corrects along the way? Anyway, the patch looks good to me. We can have a separate discussion about pessimizing the costing, if necessary. Regards, Jeff Davis
Re: Default setting for enable_hashagg_disk
On Mon, Jul 27, 2020 at 12:52 PM Alvaro Herrera wrote: > On 2020-Jul-27, Peter Geoghegan wrote: > > The v4-0001-Remove-hashagg_avoid_disk_plan-GUC.patch changes are > > surprisingly complicated. It would be nice if you could take a look at > > that aspect (or confirm that it's included in your review). > > I think you mean "it replaces surprisingly complicated code with > straightforward code". Right? Because in the previous code, there was > a lot of effort going into deciding whether the path needed to be > generated; the new code just generates the path always. Yes, that's what I meant. It's a bit tricky. For example, I have removed a redundant "cheapest_total_path != NULL" test in create_partial_grouping_paths() (two, actually). But these two tests were always redundant. I have to wonder if I missed the point. Though it seems likely that that was just an accident. Accretions of code over time made the code work like that; nothing more. -- Peter Geoghegan
Re: Default setting for enable_hashagg_disk
On 2020-Jul-27, Peter Geoghegan wrote: > The v4-0001-Remove-hashagg_avoid_disk_plan-GUC.patch changes are > surprisingly complicated. It would be nice if you could take a look at > that aspect (or confirm that it's included in your review). I think you mean "it replaces surprisingly complicated code with straightforward code". Right? Because in the previous code, there was a lot of effort going into deciding whether the path needed to be generated; the new code just generates the path always. Similarly the code to decide allow_hash in create_distinct_path, which used to be nontrivial, could (if you wanted) be simplified down to a single boolean condition. Previously, it was nontrivial only because it needed to consider memory usage -- not anymore. But maybe you're talking about something more subtle that I'm just too blind to see. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Default setting for enable_hashagg_disk
On Mon, Jul 27, 2020 at 11:24 AM Alvaro Herrera wrote: > > Are you proposing that I just put the prototype in miscadmin.h, while > > leaving the implementation where it is (in nodeHash.c)? > > Yes, that's in the part of my reply you didn't quote: > > : It remains strange to have the function in executor > : implementation, but I don't offhand see a better place, so maybe it's > : okay where it is. Got it. I tried putting the prototype in miscadmin.h, and I now agree that that's the best way to do it -- that's how I do it in the attached revision. No other changes. The v4-0001-Remove-hashagg_avoid_disk_plan-GUC.patch changes are surprisingly complicated. It would be nice if you could take a look at that aspect (or confirm that it's included in your review). -- Peter Geoghegan v4-0001-Remove-hashagg_avoid_disk_plan-GUC.patch Description: Binary data v4-0002-Add-hash_mem_multiplier-GUC.patch Description: Binary data
Re: Default setting for enable_hashagg_disk
On 2020-Jul-27, Peter Geoghegan wrote: > On Mon, Jul 27, 2020 at 10:30 AM Alvaro Herrera > wrote: > > On 2020-Jul-23, Peter Geoghegan wrote: > > I notice you put the prototype for get_hash_mem in nodeHash.h. This > > would be fine if not for the fact that optimizer needs to call the > > function too, which means now optimizer have to include executor headers > > -- not a great thing. I'd move the prototype elsewhere to avoid this, > > and I think miscadmin.h is a decent place for the prototype, next to > > work_mem and m_w_m. > > The location of get_hash_mem() is awkward, Yes. > but there is no obvious alternative. Agreed. > Are you proposing that I just put the prototype in miscadmin.h, while > leaving the implementation where it is (in nodeHash.c)? Yes, that's in the part of my reply you didn't quote: : It remains strange to have the function in executor : implementation, but I don't offhand see a better place, so maybe it's : okay where it is. > [...] moving the implementation of get_hash_mem() to either of those > two files seems worse to me. Sure. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Default setting for enable_hashagg_disk
On Mon, Jul 27, 2020 at 10:30 AM Alvaro Herrera wrote: > On 2020-Jul-23, Peter Geoghegan wrote: > I notice you put the prototype for get_hash_mem in nodeHash.h. This > would be fine if not for the fact that optimizer needs to call the > function too, which means now optimizer have to include executor headers > -- not a great thing. I'd move the prototype elsewhere to avoid this, > and I think miscadmin.h is a decent place for the prototype, next to > work_mem and m_w_m. The location of get_hash_mem() is awkward, but there is no obvious alternative. Are you proposing that I just put the prototype in miscadmin.h, while leaving the implementation where it is (in nodeHash.c)? Maybe that sounds like an odd question, but bear in mind that the natural place to put the implementation of a function declared in miscadmin.h is either utils/init/postinit.c or utils/init/miscinit.c -- moving the implementation of get_hash_mem() to either of those two files seems worse to me. That said, there is an existing oddball case in miscadmin.h, right at the end -- the two functions whose implementation is in access/transam/xlog.c. So I can see an argument for adding another oddball case (i.e. moving the prototype to the end of miscadmin.h without changing anything else). > Other than that admittedly trivial complaint, I found nothing to > complain about in this patch. Great. Thanks for the review. My intention is to commit hash_mem_multiplier on Wednesday. We need to move on from this, and get the release out the door. -- Peter Geoghegan
Re: Default setting for enable_hashagg_disk
On 2020-Jul-23, Peter Geoghegan wrote: > Attached is v3 of the hash_mem_multiplier patch series, which now has > a preparatory patch that removes hashagg_avoid_disk_plan. I notice you put the prototype for get_hash_mem in nodeHash.h. This would be fine if not for the fact that optimizer needs to call the function too, which means now optimizer have to include executor headers -- not a great thing. I'd move the prototype elsewhere to avoid this, and I think miscadmin.h is a decent place for the prototype, next to work_mem and m_w_m. It remains strange to have the function in executor implementation, but I don't offhand see a better place, so maybe it's okay where it is. Other than that admittedly trivial complaint, I found nothing to complain about in this patch. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Default setting for enable_hashagg_disk
On Sun, Jul 26, 2020 at 11:34 AM Tomas Vondra wrote: > That's 1.6GB, if I read it right. Which is more than 200MB ;-) Sigh. That solves that "mystery": the behavior that my sorted vs random example exhibited is a known limitation in hash aggs that spill (and an acceptable one). The memory usage is reported on accurately by EXPLAIN ANALYZE. -- Peter Geoghegan
Re: Default setting for enable_hashagg_disk
On Sat, Jul 25, 2020 at 05:13:00PM -0700, Peter Geoghegan wrote: On Sat, Jul 25, 2020 at 5:05 PM Tomas Vondra wrote: I'm not sure what you mean by "reported memory usage doesn't reflect the space used for transition state"? Surely it does include that, we've built the memory accounting stuff pretty much exactly to do that. I think it's pretty clear what's happening - in the sorted case there's only a single group getting new values at any moment, so when we decide to spill we'll only add rows to that group and everything else will be spilled to disk. Right. In the unsorted case however we manage to initialize all groups in the hash table, but at that point the groups are tiny an fit into work_mem. As we process more and more data the groups grow, but we can't evict them - at the moment we don't have that capability. So we end up processing everything in memory, but significantly exceeding work_mem. work_mem was set to 200MB, which is more than the reported "Peak Memory Usage: 1605334kB". So either the random case significantly That's 1.6GB, if I read it right. Which is more than 200MB ;-) exceeds work_mem and the "Peak Memory Usage" accounting is wrong (because it doesn't report this excess), or the random case really doesn't exceed work_mem but has a surprising advantage over the sorted case. -- Peter Geoghegan -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Default setting for enable_hashagg_disk
On Sat, Jul 25, 2020 at 5:31 PM Peter Geoghegan wrote: > I'm glad that this better principled approach is possible. It's hard > to judge how much of a problem this really is, though. We'll need to > think about this aspect some more. BTW, your HLL patch ameliorates the problem with my extreme "sorted vs random input" test case from this morning [1] (the thing that I just discussed with Tomas). Without the HLL patch the sorted case had 2424 batches. With the HLL patch it has 20. That at least seems like a notable improvement. [1] https://postgr.es/m/CAH2-Wz=ur7MQKpaUZJP=Adtg0TPMx5M_WoNE=ke2vUU=amd...@mail.gmail.com -- Peter Geoghegan
Re: Default setting for enable_hashagg_disk
On Sat, Jul 25, 2020 at 4:56 PM Jeff Davis wrote: > I wrote a quick patch to use HyperLogLog to estimate the number of > groups contained in a spill file. It seems to reduce the > overpartitioning effect, and is a more principled approach than what I > was doing before. This pretty much fixes the issue that I observed with overparitioning. At least in the sense that the number of partitions grows more predictably -- even when the number of partitions planned is reduced the change in the number of batches seems smooth-ish. It "looks nice". > It does seem to hurt the runtime slightly when spilling to disk in some > cases. I haven't narrowed down whether this is because we end up > recursing multiple times, or if it's just more efficient to > overpartition, or if the cost of doing the HLL itself is significant. I'm glad that this better principled approach is possible. It's hard to judge how much of a problem this really is, though. We'll need to think about this aspect some more. Thanks -- Peter Geoghegan
Re: Default setting for enable_hashagg_disk
On Sat, Jul 25, 2020 at 5:05 PM Tomas Vondra wrote: > I'm not sure what you mean by "reported memory usage doesn't reflect the > space used for transition state"? Surely it does include that, we've > built the memory accounting stuff pretty much exactly to do that. > > I think it's pretty clear what's happening - in the sorted case there's > only a single group getting new values at any moment, so when we decide > to spill we'll only add rows to that group and everything else will be > spilled to disk. Right. > In the unsorted case however we manage to initialize all groups in the > hash table, but at that point the groups are tiny an fit into work_mem. > As we process more and more data the groups grow, but we can't evict > them - at the moment we don't have that capability. So we end up > processing everything in memory, but significantly exceeding work_mem. work_mem was set to 200MB, which is more than the reported "Peak Memory Usage: 1605334kB". So either the random case significantly exceeds work_mem and the "Peak Memory Usage" accounting is wrong (because it doesn't report this excess), or the random case really doesn't exceed work_mem but has a surprising advantage over the sorted case. -- Peter Geoghegan
Re: Default setting for enable_hashagg_disk
On Sat, Jul 25, 2020 at 10:07:37AM -0700, Peter Geoghegan wrote: On Sat, Jul 25, 2020 at 9:39 AM Peter Geoghegan wrote: "Peak Memory Usage: 1605334kB" Hash agg avoids spilling entirely (so the planner gets it right this time around). It even uses notably less memory. I guess that this is because the reported memory usage doesn't reflect the space used for transition state, which is presumably most of the total -- array_agg() is used in the query. I'm not sure what you mean by "reported memory usage doesn't reflect the space used for transition state"? Surely it does include that, we've built the memory accounting stuff pretty much exactly to do that. I think it's pretty clear what's happening - in the sorted case there's only a single group getting new values at any moment, so when we decide to spill we'll only add rows to that group and everything else will be spilled to disk. In the unsorted case however we manage to initialize all groups in the hash table, but at that point the groups are tiny an fit into work_mem. As we process more and more data the groups grow, but we can't evict them - at the moment we don't have that capability. So we end up processing everything in memory, but significantly exceeding work_mem. FWIW all my tests are done on the same TPC-H data set clustered by l_shipdate (so probably random with respect to other columns). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Default setting for enable_hashagg_disk
On Sat, 2020-07-25 at 13:27 -0700, Peter Geoghegan wrote: > It's not clear to me that overpartitioning is a real problem in > > this > > case -- but I think the fact that it's causing confusion is enough > > reason to see if we can fix it. > > I'm not sure about that either. > > FWIW I notice that when I reduce work_mem a little further (to 3MB) > with the same query, the number of partitions is still 128, while the > number of run time batches is 16,512 (an increase from 11,456 from > 6MB > work_mem). I notice that 16512/128 is 129, which hints at the nature > of what's going on with the recursion. I guess it would be ideal if > the growth in batches was more gradual as I subtract memory. I wrote a quick patch to use HyperLogLog to estimate the number of groups contained in a spill file. It seems to reduce the overpartitioning effect, and is a more principled approach than what I was doing before. It does seem to hurt the runtime slightly when spilling to disk in some cases. I haven't narrowed down whether this is because we end up recursing multiple times, or if it's just more efficient to overpartition, or if the cost of doing the HLL itself is significant. Regards, Jeff Davis From 424f611f442e36a91747fd39cd28acba42eb958b Mon Sep 17 00:00:00 2001 From: Jeff Davis Date: Sat, 25 Jul 2020 14:29:59 -0700 Subject: [PATCH] HLL --- src/backend/executor/nodeAgg.c | 64 ++ src/include/executor/nodeAgg.h | 2 +- 2 files changed, 43 insertions(+), 23 deletions(-) diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c index b79c845a6b7..25771cb4869 100644 --- a/src/backend/executor/nodeAgg.c +++ b/src/backend/executor/nodeAgg.c @@ -245,9 +245,11 @@ #include "catalog/pg_aggregate.h" #include "catalog/pg_proc.h" #include "catalog/pg_type.h" +#include "common/hashfn.h" #include "executor/execExpr.h" #include "executor/executor.h" #include "executor/nodeAgg.h" +#include "lib/hyperloglog.h" #include "miscadmin.h" #include "nodes/makefuncs.h" #include "nodes/nodeFuncs.h" @@ -295,6 +297,14 @@ #define HASHAGG_READ_BUFFER_SIZE BLCKSZ #define HASHAGG_WRITE_BUFFER_SIZE BLCKSZ +/* + * HyperLogLog is used for estimating the cardinality of the spilled tuples in + * a given partition. 5 bits corresponds to a size of about 32 bytes and a + * worst-case error of around 18%. That's effective enough to choose a + * reasonable number of partitions when recursing. + */ +#define HASHAGG_HLL_BIT_WIDTH 5 + /* * Estimate chunk overhead as a constant 16 bytes. XXX: should this be * improved? @@ -339,6 +349,7 @@ typedef struct HashAggSpill int64 *ntuples; /* number of tuples in each partition */ uint32 mask; /* mask to find partition from hash value */ int shift; /* after masking, shift by this amount */ + hyperLogLogState *hll_card; /* cardinality estimate for contents */ } HashAggSpill; /* @@ -357,6 +368,7 @@ typedef struct HashAggBatch LogicalTapeSet *tapeset; /* borrowed reference to tape set */ int input_tapenum; /* input partition tape */ int64 input_tuples; /* number of tuples in this batch */ + double input_card; /* estimated group cardinality */ } HashAggBatch; /* used to find referenced colnos */ @@ -409,7 +421,7 @@ static void hashagg_recompile_expressions(AggState *aggstate, bool minslot, static long hash_choose_num_buckets(double hashentrysize, long estimated_nbuckets, Size memory); -static int hash_choose_num_partitions(uint64 input_groups, +static int hash_choose_num_partitions(double input_groups, double hashentrysize, int used_bits, int *log2_npartittions); @@ -429,10 +441,11 @@ static void hashagg_finish_initial_spills(AggState *aggstate); static void hashagg_reset_spill_state(AggState *aggstate); static HashAggBatch *hashagg_batch_new(LogicalTapeSet *tapeset, int input_tapenum, int setno, - int64 input_tuples, int used_bits); + int64 input_tuples, double input_card, + int used_bits); static MinimalTuple hashagg_batch_read(HashAggBatch *batch, uint32 *hashp); static void hashagg_spill_init(HashAggSpill *spill, HashTapeInfo *tapeinfo, - int used_bits, uint64 input_tuples, + int used_bits, double input_groups, double hashentrysize); static Size hashagg_spill_tuple(AggState *aggstate, HashAggSpill *spill, TupleTableSlot *slot, uint32 hash); @@ -1775,7 +1788,7 @@ hashagg_recompile_expressions(AggState *aggstate, bool minslot, bool nullcheck) * substantially larger than the initial value. */ void -hash_agg_set_limits(double hashentrysize, uint64 input_groups, int used_bits, +hash_agg_set_limits(double hashentrysize, double input_groups, int used_bits, Size *mem_limit, uint64 *ngroups_limit, int *num_partitions) { @@ -1967,7 +1980,7 @@ hash_choose_num_buckets(double hashentrysize, long ngroups, Size memory) * *log2_npartitions to the log2(
Re: Default setting for enable_hashagg_disk
On Sat, Jul 25, 2020 at 1:10 PM Jeff Davis wrote: > On Sat, 2020-07-25 at 11:05 -0700, Peter Geoghegan wrote: > > What worries me a bit is the sharp discontinuities when spilling with > > significantly less work_mem than the "optimal" amount. For example, > > with Tomas' TPC-H query (against my smaller TPC-H dataset), I find > > that setting work_mem to 6MB looks like this: > > ... > > > Planned Partitions: 128 Peak Memory Usage: 6161kB Disk > > Usage: 2478080kB HashAgg Batches: 128 > > ... > > > Planned Partitions: 128 Peak Memory Usage: 5393kB Disk > > Usage: 2482152kB HashAgg Batches: 11456 > It's not clear to me that overpartitioning is a real problem in this > case -- but I think the fact that it's causing confusion is enough > reason to see if we can fix it. I'm not sure about that either. FWIW I notice that when I reduce work_mem a little further (to 3MB) with the same query, the number of partitions is still 128, while the number of run time batches is 16,512 (an increase from 11,456 from 6MB work_mem). I notice that 16512/128 is 129, which hints at the nature of what's going on with the recursion. I guess it would be ideal if the growth in batches was more gradual as I subtract memory. -- Peter Geoghegan
Re: Default setting for enable_hashagg_disk
On Sat, 2020-07-25 at 11:05 -0700, Peter Geoghegan wrote: > What worries me a bit is the sharp discontinuities when spilling with > significantly less work_mem than the "optimal" amount. For example, > with Tomas' TPC-H query (against my smaller TPC-H dataset), I find > that setting work_mem to 6MB looks like this: ... > Planned Partitions: 128 Peak Memory Usage: 6161kB Disk > Usage: 2478080kB HashAgg Batches: 128 ... > Planned Partitions: 128 Peak Memory Usage: 5393kB Disk > Usage: 2482152kB HashAgg Batches: 11456 ... > My guess that this is because the > recursive hash aggregation misbehaves in a self-similar fashion once > a > certain tipping point has been reached. It looks like it might be fairly easy to use HyperLogLog as an estimator for the recursive step. That should reduce the overpartitioning, which I believe is the cause of this discontinuity. It's not clear to me that overpartitioning is a real problem in this case -- but I think the fact that it's causing confusion is enough reason to see if we can fix it. Regards, Jeff Davis
Re: Default setting for enable_hashagg_disk
On Sat, Jul 25, 2020 at 10:23 AM Jeff Davis wrote: > There's also another effect at work that can cause the total number of > batches to be higher for larger work_mem values: when we do recurse, we > again need to estimate the number of partitions needed. Right now, we > overestimate the number of partitions needed (to be conservative), > which leads to a wider fan-out and lots of tiny partitions, and > therefore more batches. What worries me a bit is the sharp discontinuities when spilling with significantly less work_mem than the "optimal" amount. For example, with Tomas' TPC-H query (against my smaller TPC-H dataset), I find that setting work_mem to 6MB looks like this: -> HashAggregate (cost=2700529.47..3020654.22 rows=1815500 width=40) (actual time=21039.788..32278.703 rows=200 loops=1) Output: lineitem.l_partkey, (0.2 * avg(lineitem.l_quantity)) Group Key: lineitem.l_partkey Planned Partitions: 128 Peak Memory Usage: 6161kB Disk Usage: 2478080kB HashAgg Batches: 128 (And we have a sensible looking number of batches that match the number of planned partitions with higher work_mem settings, too.) However, if I set work_mem to 5MB (or less), it looks like this: -> HashAggregate (cost=2700529.47..3020654.22 rows=1815500 width=40) (actual time=20849.490..37027.533 rows=200 loops=1) Output: lineitem.l_partkey, (0.2 * avg(lineitem.l_quantity)) Group Key: lineitem.l_partkey Planned Partitions: 128 Peak Memory Usage: 5393kB Disk Usage: 2482152kB HashAgg Batches: 11456 So the number of partitions is still 128, but the number of batches explodes to 11,456 all at once. My guess that this is because the recursive hash aggregation misbehaves in a self-similar fashion once a certain tipping point has been reached. I expect that the exact nature of that tipping point is very complicated, and generally dependent on the dataset, clustering, etc. But I don't think that this kind of effect will be uncommon. (FWIW this example requires ~620MB work_mem to complete without spilling at all -- so it's kind of extreme, though not quite as extreme as many of the similar test results from Tomas.) -- Peter Geoghegan
Re: Default setting for enable_hashagg_disk
On Thu, 2020-07-23 at 19:33 -0700, Peter Geoghegan wrote: > That does make it sound like the costs of the hash agg aren't being > represented. I suppose it isn't clear if this is a costing issue > because it isn't clear if the execution time performance itself is > pathological or is instead something that must be accepted as the > cost > of spilling the hash agg in a general kind of way. I have a feeling that this is mostly a costing problem. Sort uses its memory in two different phases: 1. when writing the sorted runs, it needs the memory to hold the run before sorting it, and only a single buffer for the output tape; and 2. when merging, it needs a lot of read buffers But in HashAgg, it needs to hold all of the groups in memory *at the same time* as it needs a lot of output buffers (one for each partition). This doesn't matter a lot at high values of work_mem, because the buffers will only be 8MB at most. I did attempt to cost this properly: hash_agg_set_limits() takes into account the memory the partitions will use, and the remaining memory is what's used in cost_agg(). But there's a lot of room for error in there. If someone sees an obvious error in the costing, please let me know. Otherwise, I think it will just take some time to make it better reflect reality in a variety of cases. For v13, and we will either need to live with it, or pessimize the costing for HashAgg until we get it right. Many costing issues can deal with a lot of slop -- e.g. HashJoin vs MergeJoin -- because a small factor often doesn't make the difference between plans. But HashAgg and Sort are more competitive with each other, so costing needs to be more precise. Regards, Jeff Davis
Re: Default setting for enable_hashagg_disk
On Fri, 2020-07-24 at 10:40 +0200, Tomas Vondra wrote: > FWIW one more suspicious thing that I forgot to mention is the > behavior > of the "planned partitions" depending on work_mem, which looks like > this: > >2MB Planned Partitions: 64HashAgg Batches: 4160 >4MB Planned Partitions: 128HashAgg Batches: 16512 >8MB Planned Partitions: 256HashAgg Batches: 21488 > 64MB Planned Partitions: 32HashAgg Batches: 2720 > 256MB Planned Partitions: 8HashAgg Batches: 8 > > I'd expect the number of planned partitions to decrease (slowly) as > work_mem increases, but it seems to increase initially. Seems a bit > strange, but maybe it's expected. The space for open-partition buffers is also limited to about 25% of memory. Each open partition takes BLCKSZ memory, so those numbers are exactly what I'd expect (64*8192 = 512kB). There's also another effect at work that can cause the total number of batches to be higher for larger work_mem values: when we do recurse, we again need to estimate the number of partitions needed. Right now, we overestimate the number of partitions needed (to be conservative), which leads to a wider fan-out and lots of tiny partitions, and therefore more batches. I think we can improve this by using something like a HyperLogLog on the hash values of the spilled tuples to get a better estimate for the number of groups (and therefore the number of partitions) that we need when we recurse, which would reduce the number of overall batches at higher work_mem settings. But I didn't get a chance to implement that yet. Regards, Jeff Davis
Re: Default setting for enable_hashagg_disk
On Sat, Jul 25, 2020 at 9:39 AM Peter Geoghegan wrote: > "Peak Memory Usage: 1605334kB" > > Hash agg avoids spilling entirely (so the planner gets it right this > time around). It even uses notably less memory. I guess that this is because the reported memory usage doesn't reflect the space used for transition state, which is presumably most of the total -- array_agg() is used in the query. -- Peter Geoghegan
Re: Default setting for enable_hashagg_disk
On Fri, Jul 24, 2020 at 12:55 PM Peter Geoghegan wrote: > Could that be caused by clustering in the data? > > If the input data is in totally random order then we have a good > chance of never having to spill skewed "common" values. That is, we're > bound to encounter common values before entering spill mode, and so > those common values will continue to be usefully aggregated until > we're done with the initial groups (i.e. until the in-memory hash > table is cleared in order to process spilled input tuples). This is > great because the common values get aggregated without ever spilling, > and most of the work is done before we even begin with spilled tuples. > > If, on the other hand, the common values are concentrated together in > the input... I still don't know if that was a factor in your example, but I can clearly demonstrate that the clustering of data can matter a lot to hash aggs in Postgres 13. I attach a contrived example where it makes a *huge* difference. I find that the sorted version of the aggregate query takes significantly longer to finish, and has the following spill characteristics: "Peak Memory Usage: 205086kB Disk Usage: 2353920kB HashAgg Batches: 2424" Note that the planner doesn't expect any partitions here, but we still get 2424 batches -- so the planner seems to get it totally wrong. OTOH, the same query against a randomized version of the same data (no longer in sorted order, no clustering) works perfectly with the same work_mem (200MB): "Peak Memory Usage: 1605334kB" Hash agg avoids spilling entirely (so the planner gets it right this time around). It even uses notably less memory. -- Peter Geoghegan test-agg-sorted.sql Description: Binary data
Re: Default setting for enable_hashagg_disk
On Fri, 2020-07-24 at 21:16 +0200, Tomas Vondra wrote: > Surely more recursive spilling should do more I/O, but the Disk Usage > reported by explain analyze does not show anything like ... I suspect that's because of disk reuse in logtape.c. Regards, Jeff Davis
Re: Default setting for enable_hashagg_disk
On Fri, Jul 24, 2020 at 12:16 PM Tomas Vondra wrote: > Maybe, but we're nowhere close to these limits. See this table which I > posted earlier: > >2MB Planned Partitions: 64HashAgg Batches: 4160 >4MB Planned Partitions: 128HashAgg Batches: 16512 >8MB Planned Partitions: 256HashAgg Batches: 21488 > 64MB Planned Partitions: 32HashAgg Batches: 2720 > 256MB Planned Partitions: 8HashAgg Batches: 8 > > This is from the non-parallel runs on the i5 machine with 32GB data set, > the first column is work_mem. We're nowhere near the 1024 limit, and the > cardinality estimates are pretty good. > > OTOH the number o batches is much higher, so clearly there was some > recursive spilling happening. What I find strange is that this grows > with work_mem and only starts dropping after 64MB. Could that be caused by clustering in the data? If the input data is in totally random order then we have a good chance of never having to spill skewed "common" values. That is, we're bound to encounter common values before entering spill mode, and so those common values will continue to be usefully aggregated until we're done with the initial groups (i.e. until the in-memory hash table is cleared in order to process spilled input tuples). This is great because the common values get aggregated without ever spilling, and most of the work is done before we even begin with spilled tuples. If, on the other hand, the common values are concentrated together in the input... Assuming that I have this right, then I would also expect simply having more memory to ameliorate the problem. If you only have/need 4 or 8 partitions then you can fit a higher proportion of the total number of groups for the whole dataset in the hash table (at the point when you first enter spill mode). I think it follows that the "nailed" hash table entries/groupings will "better characterize" the dataset as a whole. > Also, how could the amount of I/O be almost constant in all these cases? > Surely more recursive spilling should do more I/O, but the Disk Usage > reported by explain analyze does not show anything like ... Not sure, but might that just be because of the fact that logtape.c can recycle disk space? As I said in my last e-mail, it's pretty reasonable to assume that the vast majority of external sorts are one-pass. It follows that disk usage can be thought of as almost the same thing as total I/O for tuplesort. But the same heuristic isn't reasonable when thinking about hash agg. Hash agg might write out much less data than the total memory used for the equivalent "peak optimal nospill" hash agg case -- or much more. (Again, reiterating what I said in my last e-mail.) -- Peter Geoghegan
Re: Default setting for enable_hashagg_disk
On Fri, Jul 24, 2020 at 11:03:54AM -0700, Peter Geoghegan wrote: On Fri, Jul 24, 2020 at 8:19 AM Robert Haas wrote: This is all really good analysis, I think, but this seems like the key finding. It seems like we don't really understand what's actually getting written. Whether we use hash or sort doesn't seem like it should have this kind of impact on how much data gets written, and whether we use CP_SMALL_TLIST or project when needed doesn't seem like it should matter like this either. Isn't this more or less the expected behavior in the event of partitions that are spilled recursively? The case that Tomas tested were mostly cases where work_mem was tiny relative to the data being aggregated. The following is an extract from commit 1f39bce0215 showing some stuff added to the beginning of nodeAgg.c: + * We also specify a min and max number of partitions per spill. Too few might + * mean a lot of wasted I/O from repeated spilling of the same tuples. Too + * many will result in lots of memory wasted buffering the spill files (which + * could instead be spent on a larger hash table). + */ +#define HASHAGG_PARTITION_FACTOR 1.50 +#define HASHAGG_MIN_PARTITIONS 4 +#define HASHAGG_MAX_PARTITIONS 1024 Maybe, but we're nowhere close to these limits. See this table which I posted earlier: 2MB Planned Partitions: 64HashAgg Batches: 4160 4MB Planned Partitions: 128HashAgg Batches: 16512 8MB Planned Partitions: 256HashAgg Batches: 21488 64MB Planned Partitions: 32HashAgg Batches: 2720 256MB Planned Partitions: 8HashAgg Batches: 8 This is from the non-parallel runs on the i5 machine with 32GB data set, the first column is work_mem. We're nowhere near the 1024 limit, and the cardinality estimates are pretty good. OTOH the number o batches is much higher, so clearly there was some recursive spilling happening. What I find strange is that this grows with work_mem and only starts dropping after 64MB. Also, how could the amount of I/O be almost constant in all these cases? Surely more recursive spilling should do more I/O, but the Disk Usage reported by explain analyze does not show anything like ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Default setting for enable_hashagg_disk
On Fri, Jul 24, 2020 at 1:40 AM Tomas Vondra wrote: > Maybe, not sure what exactly you think is pathological? The trouble is > hashagg has to spill input tuples but the memory used in no-spill case > represents aggregated groups, so I'm not sure how you could extrapolate > from that ... Yeah, but when hash agg enters spill mode it will continue to advance the transition states for groups already in the hash table, which could be quite a significant effect. The peak memory usage for an equivalent no-spill hash agg is therefore kind of related to the amount of I/O needed for spilling. I suppose that you mostly tested cases where memory was in very short supply, where that breaks down completely. Perhaps it wasn't helpful for me to bring that factor into this discussion -- it's not as if there is any doubt that hash agg is spilling a lot more here in any case. > Not sure, but I think we need to spill roughly as much as sort, so it > seems a bit strange that (a) we're spilling 2x as much data and yet the > cost is so much lower. ISTM that the amount of I/O that hash agg performs can vary *very* widely for the same data. This is mostly determined by work_mem, but there are second order effects. OTOH, the amount of I/O that a sort must do is practically fixed. You can quibble with that characterisation a bit because of multi-pass sorts, but not really -- multi-pass sorts are generally quite rare. I think that we need a more sophisticated cost model for this in cost_agg(). Maybe the "pages_written" calculation could be pessimized. However, it doesn't seem like this is precisely an issue with I/O costs. -- Peter Geoghegan
Re: Default setting for enable_hashagg_disk
On Fri, Jul 24, 2020 at 8:19 AM Robert Haas wrote: > This is all really good analysis, I think, but this seems like the key > finding. It seems like we don't really understand what's actually > getting written. Whether we use hash or sort doesn't seem like it > should have this kind of impact on how much data gets written, and > whether we use CP_SMALL_TLIST or project when needed doesn't seem like > it should matter like this either. Isn't this more or less the expected behavior in the event of partitions that are spilled recursively? The case that Tomas tested were mostly cases where work_mem was tiny relative to the data being aggregated. The following is an extract from commit 1f39bce0215 showing some stuff added to the beginning of nodeAgg.c: + * We also specify a min and max number of partitions per spill. Too few might + * mean a lot of wasted I/O from repeated spilling of the same tuples. Too + * many will result in lots of memory wasted buffering the spill files (which + * could instead be spent on a larger hash table). + */ +#define HASHAGG_PARTITION_FACTOR 1.50 +#define HASHAGG_MIN_PARTITIONS 4 +#define HASHAGG_MAX_PARTITIONS 1024 -- Peter Geoghegan
Re: Default setting for enable_hashagg_disk
On Fri, Jul 24, 2020 at 11:18:48AM -0400, Robert Haas wrote: On Thu, Jul 23, 2020 at 9:22 PM Tomas Vondra wrote: 2MB 4MB8MB64MB256MB --- hash 6.716.70 6.736.44 5.81 hash CP_SMALL_TLIST 5.285.26 5.245.04 4.54 sort 3.413.41 3.413.57 3.45 So sort writes ~3.4GB of data, give or take. But hashagg/master writes almost 6-7GB of data, i.e. almost twice as much. Meanwhile, with the original CP_SMALL_TLIST we'd write "only" ~5GB of data. That's still much more than the 3.4GB of data written by sort (which has to spill everything, while hashagg only spills rows not covered by the groups that fit into work_mem). I initially assumed this is due to writing the hash value to the tapes, and the rows are fairly narrow (only about 40B per row), so a 4B hash could make a difference - but certainly not this much. Moreover, that does not explain the difference between master and the now-reverted CP_SMALL_TLIST, I think. This is all really good analysis, I think, but this seems like the key finding. It seems like we don't really understand what's actually getting written. Whether we use hash or sort doesn't seem like it should have this kind of impact on how much data gets written, and whether we use CP_SMALL_TLIST or project when needed doesn't seem like it should matter like this either. I think for CP_SMALL_TLIST at least some of the extra data can be attributed to writing the hash value along with the tuple, which sort obviously does not do. With the 32GB data set (the i5 machine), there are ~20M rows in the lineitem table, and with 4B hash values that's about 732MB of extra data. That's about the 50% of the difference between sort and CP_SMALL_TLIST, and I'd dare to speculate the other 50% is due to LogicalTape internals (pointers to the next block, etc.) The question is why master has 2x the overhead of CP_SMALL_TLIST, if it's meant to write the same set of columns etc. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Default setting for enable_hashagg_disk
On Thu, Jul 23, 2020 at 9:22 PM Tomas Vondra wrote: >2MB 4MB8MB64MB256MB > --- > hash 6.716.70 6.736.44 5.81 > hash CP_SMALL_TLIST 5.285.26 5.245.04 4.54 > sort 3.413.41 3.413.57 3.45 > > So sort writes ~3.4GB of data, give or take. But hashagg/master writes > almost 6-7GB of data, i.e. almost twice as much. Meanwhile, with the > original CP_SMALL_TLIST we'd write "only" ~5GB of data. That's still > much more than the 3.4GB of data written by sort (which has to spill > everything, while hashagg only spills rows not covered by the groups > that fit into work_mem). > > I initially assumed this is due to writing the hash value to the tapes, > and the rows are fairly narrow (only about 40B per row), so a 4B hash > could make a difference - but certainly not this much. Moreover, that > does not explain the difference between master and the now-reverted > CP_SMALL_TLIST, I think. This is all really good analysis, I think, but this seems like the key finding. It seems like we don't really understand what's actually getting written. Whether we use hash or sort doesn't seem like it should have this kind of impact on how much data gets written, and whether we use CP_SMALL_TLIST or project when needed doesn't seem like it should matter like this either. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Default setting for enable_hashagg_disk
On Fri, Jul 24, 2020 at 10:40:47AM +0200, Tomas Vondra wrote: On Thu, Jul 23, 2020 at 07:33:45PM -0700, Peter Geoghegan wrote: On Thu, Jul 23, 2020 at 6:22 PM Tomas Vondra wrote: So let me share some fresh I/O statistics collected on the current code using iosnoop. I've done the tests on two different machines using the "aggregate part" of TPC-H Q17, i.e. essentially this: SELECT * FROM ( SELECT l_partkey AS agg_partkey, 0.2 * avg(l_quantity) AS avg_quantity FROM lineitem GROUP BY l_partkey OFFSET 10 ) part_agg; The OFFSET is there just to ensure we don't need to send anything to the client, etc. Thanks for testing this. So sort writes ~3.4GB of data, give or take. But hashagg/master writes almost 6-7GB of data, i.e. almost twice as much. Meanwhile, with the original CP_SMALL_TLIST we'd write "only" ~5GB of data. That's still much more than the 3.4GB of data written by sort (which has to spill everything, while hashagg only spills rows not covered by the groups that fit into work_mem). What I find when I run your query (with my own TPC-H DB that is smaller than what you used here -- 59,986,052 lineitem tuples) is that the sort required about 7x more memory than the hash agg to do everything in memory: 4,384,711KB for the quicksort vs 630,801KB peak hash agg memory usage. I'd be surprised if the ratio was very different for you -- but can you check? I can check, but it's not quite clear to me what are we looking for? Increase work_mem until there's no need to spill in either case? FWIW the hashagg needs about 4775953kB and the sort 33677586kB. So yeah, that's about 7x more. I think that's probably built into the TPC-H data set. It'd be easy to construct cases with much higher/lower factors. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services QUERY PLAN - Limit (cost=15493271.13..15493271.13 rows=1 width=36) (actual time=331351.099..331351.099 rows=0 loops=1) -> HashAggregate (cost=15186647.64..15493271.13 rows=20441566 width=36) (actual time=318190.465..330956.383 rows=1500 loops=1) Group Key: lineitem.l_partkey Peak Memory Usage: 4775953kB -> Seq Scan on lineitem (cost=0.00..12936556.76 rows=450018176 width=9) (actual time=0.156..56051.850 rows=450019701 loops=1) Planning Time: 0.151 ms Execution Time: 331931.239 ms (7 rows) QUERY PLAN --- Limit (cost=81298097.02..81298097.02 rows=1 width=36) (actual time=415344.639..415344.639 rows=0 loops=1) -> GroupAggregate (cost=77616337.21..81298097.02 rows=20441566 width=36) (actual time=209292.469..414951.954 rows=1500 loops=1) Group Key: lineitem.l_partkey -> Sort (cost=77616337.21..78741382.65 rows=450018176 width=9) (actual time=209292.435..329583.999 rows=450019701 loops=1) Sort Key: lineitem.l_partkey Sort Method: quicksort Memory: 33677586kB -> Seq Scan on lineitem (cost=0.00..12936556.76 rows=450018176 width=9) (actual time=0.096..72474.733 rows=450019701 loops=1) Planning Time: 0.145 ms Execution Time: 417157.598 ms (9 rows)
Re: Default setting for enable_hashagg_disk
On Thu, Jul 23, 2020 at 07:33:45PM -0700, Peter Geoghegan wrote: On Thu, Jul 23, 2020 at 6:22 PM Tomas Vondra wrote: So let me share some fresh I/O statistics collected on the current code using iosnoop. I've done the tests on two different machines using the "aggregate part" of TPC-H Q17, i.e. essentially this: SELECT * FROM ( SELECT l_partkey AS agg_partkey, 0.2 * avg(l_quantity) AS avg_quantity FROM lineitem GROUP BY l_partkey OFFSET 10 ) part_agg; The OFFSET is there just to ensure we don't need to send anything to the client, etc. Thanks for testing this. So sort writes ~3.4GB of data, give or take. But hashagg/master writes almost 6-7GB of data, i.e. almost twice as much. Meanwhile, with the original CP_SMALL_TLIST we'd write "only" ~5GB of data. That's still much more than the 3.4GB of data written by sort (which has to spill everything, while hashagg only spills rows not covered by the groups that fit into work_mem). What I find when I run your query (with my own TPC-H DB that is smaller than what you used here -- 59,986,052 lineitem tuples) is that the sort required about 7x more memory than the hash agg to do everything in memory: 4,384,711KB for the quicksort vs 630,801KB peak hash agg memory usage. I'd be surprised if the ratio was very different for you -- but can you check? I can check, but it's not quite clear to me what are we looking for? Increase work_mem until there's no need to spill in either case? I think that there is something pathological about this spill behavior, because it sounds like the precise opposite of what you might expect when you make a rough extrapolation of what disk I/O will be based on the memory used in no-spill cases (as reported by EXPLAIN ANALYZE). Maybe, not sure what exactly you think is pathological? The trouble is hashagg has to spill input tuples but the memory used in no-spill case represents aggregated groups, so I'm not sure how you could extrapolate from that ... FWIW one more suspicious thing that I forgot to mention is the behavior of the "planned partitions" depending on work_mem, which looks like this: 2MB Planned Partitions: 64HashAgg Batches: 4160 4MB Planned Partitions: 128HashAgg Batches: 16512 8MB Planned Partitions: 256HashAgg Batches: 21488 64MB Planned Partitions: 32HashAgg Batches: 2720 256MB Planned Partitions: 8HashAgg Batches: 8 I'd expect the number of planned partitions to decrease (slowly) as work_mem increases, but it seems to increase initially. Seems a bit strange, but maybe it's expected. What I find really surprising is the costing - despite writing about twice as much data, the hashagg cost is estimated to be much lower than the sort. For example on the i5 machine, the hashagg cost is ~10M, while sort cost is almost 42M. Despite using almost twice as much disk. And the costing is exactly the same for master and the CP_SMALL_TLIST. That does make it sound like the costs of the hash agg aren't being represented. I suppose it isn't clear if this is a costing issue because it isn't clear if the execution time performance itself is pathological or is instead something that must be accepted as the cost of spilling the hash agg in a general kind of way. Not sure, but I think we need to spill roughly as much as sort, so it seems a bit strange that (a) we're spilling 2x as much data and yet the cost is so much lower. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Default setting for enable_hashagg_disk
On Thu, Jul 23, 2020 at 6:22 PM Tomas Vondra wrote: > So let me share some fresh I/O statistics collected on the current code > using iosnoop. I've done the tests on two different machines using the > "aggregate part" of TPC-H Q17, i.e. essentially this: > > SELECT * FROM ( > SELECT > l_partkey AS agg_partkey, > 0.2 * avg(l_quantity) AS avg_quantity > FROM lineitem GROUP BY l_partkey OFFSET 10 > ) part_agg; > > The OFFSET is there just to ensure we don't need to send anything to > the client, etc. Thanks for testing this. > So sort writes ~3.4GB of data, give or take. But hashagg/master writes > almost 6-7GB of data, i.e. almost twice as much. Meanwhile, with the > original CP_SMALL_TLIST we'd write "only" ~5GB of data. That's still > much more than the 3.4GB of data written by sort (which has to spill > everything, while hashagg only spills rows not covered by the groups > that fit into work_mem). What I find when I run your query (with my own TPC-H DB that is smaller than what you used here -- 59,986,052 lineitem tuples) is that the sort required about 7x more memory than the hash agg to do everything in memory: 4,384,711KB for the quicksort vs 630,801KB peak hash agg memory usage. I'd be surprised if the ratio was very different for you -- but can you check? I think that there is something pathological about this spill behavior, because it sounds like the precise opposite of what you might expect when you make a rough extrapolation of what disk I/O will be based on the memory used in no-spill cases (as reported by EXPLAIN ANALYZE). > What I find really surprising is the costing - despite writing about > twice as much data, the hashagg cost is estimated to be much lower than > the sort. For example on the i5 machine, the hashagg cost is ~10M, while > sort cost is almost 42M. Despite using almost twice as much disk. And > the costing is exactly the same for master and the CP_SMALL_TLIST. That does make it sound like the costs of the hash agg aren't being represented. I suppose it isn't clear if this is a costing issue because it isn't clear if the execution time performance itself is pathological or is instead something that must be accepted as the cost of spilling the hash agg in a general kind of way. -- Peter Geoghegan
Re: Default setting for enable_hashagg_disk
On Fri, Jul 17, 2020 at 5:13 PM Jeff Davis wrote: > The patch itself looks reasonable to me. I don't see a lot of obvious > dangers, but perhaps someone would like to take a closer look at the > planner changes as you suggest. Attached is v3 of the hash_mem_multiplier patch series, which now has a preparatory patch that removes hashagg_avoid_disk_plan. What do you think of this approach, Jeff? It seems as if removing hashagg_avoid_disk_plan will necessitate removing various old bits of planner.c that were concerned with avoiding hash aggs that spill (the bits that hashagg_avoid_disk_plan skips in the common case where it's turned off). This makes v3-0001-* a bit trickier than I imagined it would have to be. At least it lowers the footprint of the hash_mem_multiplier code added by v3-0002-* (compared to the last version of the patch). I find the partial group paths stuff added to planner.c by commit 4f15e5d09de rather confusing (that commit was preparatory work for the main feature commit e2f1eb0e). Hopefully the hash_mem_multiplier-removal patch didn't get anything wrong in this area. Perhaps Robert can comment on this as the committer of record for partition-wise grouping/aggregation. I would like to commit this patch series by next week, and close out the two relevant open items. Separately, I suspect that we'll also need to update the cost model for hash aggs that spill, but that now seems like a totally unrelated matter. I'm waiting to hear back from Tomas about that. Tomas? Thanks -- Peter Geoghegan v3-0001-Remove-hashagg_avoid_disk_plan-GUC.patch Description: Binary data v3-0002-Add-hash_mem_multiplier-GUC.patch Description: Binary data
Re: Default setting for enable_hashagg_disk
On Tue, Jul 21, 2020 at 1:30 PM Bruce Momjian wrote: > On Tue, Jul 14, 2020 at 03:49:40PM -0700, Peter Geoghegan wrote: > > Maybe I missed your point here. The problem is not so much that we'll > > get HashAggs that spill -- there is nothing intrinsically wrong with > > that. While it's true that the I/O pattern is not as sequential as a > > similar group agg + sort, that doesn't seem like the really important > > factor here. The really important factor is that in-memory HashAggs > > can be blazingly fast relative to *any* alternative strategy -- be it > > a HashAgg that spills, or a group aggregate + sort that doesn't spill, > > whatever. We're mostly concerned about keeping the one available fast > > strategy than we are about getting a new, generally slow strategy. > > Do we have any data that in-memory HashAggs are "blazingly fast relative > to *any* alternative strategy?" The data I have tested myself and what > I saw from Tomas was that spilling sort or spilling hash are both 2.5x > slower. Are we sure the quoted statement is true? I admit that I was unclear in the remarks you quote here. I placed too much emphasis on the precise cross-over point at which a hash agg that didn't spill in Postgres 12 spills now. That was important to Andres, who was concerned about the added I/O, especially with things like cloud providers [1] -- it's not desirable to go from no I/O to lots of I/O when upgrading, regardless of how fast your disks for temp files are. But that was not the point I was trying to make (though it's a good point, and one that I agree with). I'll take another shot at it. I'll use with Andres' test case in [1]. Specifically this query (I went with this example because it was convenient): SELECT a, array_agg(b) FROM (SELECT generate_series(1, 1)) a(a), (SELECT generate_series(1, 1)) b(b) GROUP BY a HAVING array_length(array_agg(b), 1) = 0; The planner generally prefers a hashagg here, though it's not a particularly sympathetic case for hash agg. For one thing the input to the sort is already sorted. For another, there isn't skew. But the planner seems to have it right, at least when everything fits in memory, because that takes ~17.6 seconds with a group agg + sort vs ~13.2 seconds with an in-memory hash agg. Importantly, hash agg's peak memory usage is 1443558kB (once we get to the point that no spilling is required), whereas for the sort we're using 7833229kB for the quicksort. Don't forget that in-memory hash agg is using ~5.4x less memory in this case on account of the way hash agg represents things. It's faster, and much much more efficient once you take a holistic view (i.e. something like work done per second per KB of memory). Clearly the precise "query operation spills" cross-over point isn't that relevant to query execution time (on my server with a fast nvme SSD), because if I give the sort 95% - 99% of the memory it needs to be an in-memory quicksort then it makes a noticeable difference, but not a huge difference. I get one big run and one tiny run in the tuplesort. The query itself takes ~23.4 seconds -- higher than 17.6 seconds, but not all that much higher considering we have to write and read ~7GB of data. If I try to do approximately the same thing with hash agg (give it very slightly less than optimal memory) I find that the difference is smaller -- it takes ~14.1 seconds (up from ~13.2 seconds). It looks like my original remarks are totally wrong so far, because it's as if the performance hit is entirely explainable as the extra temp file I/O (right down to the fact that hash agg takes a smaller hit because it has much less to write out to disk). But let's keep going. = Sort vs Hash = We'll focus on how the group agg + sort case behaves as we take memory away. What I notice is that it literally doesn't matter how much memory I take away any more (now that the sort has started to spill). I said that it was ~23.4 seconds when we have two runs, but if I keep taking memory away so that we get 10 runs it takes 23.2 seconds. If there are 36 runs it takes 22.8 seconds. And if there are 144 runs (work_mem is 50MB, down from the "optimal" required for the sort to be internal, ~7GB) then it takes 21.9 seconds. So it gets slightly faster, not slower. We really don't need very much memory to do the sort in one pass, and it pretty much doesn't matter how many runs we need to merge provided it doesn't get into the thousands, which is quite rare (when random I/O from multiple passes finally starts to bite). Now for hash agg -- this is where it gets interesting. If we give it about half the memory it needs (work_mem 700MB) we still have 4 batches and it hardly changes -- it takes 19.8 seconds, which is slower than the 4 batch case that took 14.1 seconds but not that surprising. 300MB still gets 4 batches which now takes ~23.5 seconds. 200MB gets 2424 batches and takes ~27.7 seconds -- a big jump! With 100MB it takes ~31.1 seconds (3340 batches). 50MB it's ~32.8 seconds (3
Re: Default setting for enable_hashagg_disk
On Tue, Jul 14, 2020 at 6:49 PM Peter Geoghegan wrote: > Maybe I missed your point here. The problem is not so much that we'll > get HashAggs that spill -- there is nothing intrinsically wrong with > that. While it's true that the I/O pattern is not as sequential as a > similar group agg + sort, that doesn't seem like the really important > factor here. The really important factor is that in-memory HashAggs > can be blazingly fast relative to *any* alternative strategy -- be it > a HashAgg that spills, or a group aggregate + sort that doesn't spill, > whatever. We're mostly concerned about keeping the one available fast > strategy than we are about getting a new, generally slow strategy. I don't know; it depends. Like, if the less-sequential I/O pattern that is caused by a HashAgg is not really any slower than a Sort+GroupAgg, then whatever. The planner might as well try a HashAgg - because it will be fast if it stays in memory - and if it doesn't work out, we've lost little by trying. But if a Sort+GroupAgg is noticeably faster than a HashAgg that ends up spilling, then there is a potential regression. I thought we had evidence that this was a real problem, but if that's not the case, then I think we're fine as far as v13 goes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Default setting for enable_hashagg_disk
On Tue, Jul 14, 2020 at 03:49:40PM -0700, Peter Geoghegan wrote: > Maybe I missed your point here. The problem is not so much that we'll > get HashAggs that spill -- there is nothing intrinsically wrong with > that. While it's true that the I/O pattern is not as sequential as a > similar group agg + sort, that doesn't seem like the really important > factor here. The really important factor is that in-memory HashAggs > can be blazingly fast relative to *any* alternative strategy -- be it > a HashAgg that spills, or a group aggregate + sort that doesn't spill, > whatever. We're mostly concerned about keeping the one available fast > strategy than we are about getting a new, generally slow strategy. Do we have any data that in-memory HashAggs are "blazingly fast relative to *any* alternative strategy?" The data I have tested myself and what I saw from Tomas was that spilling sort or spilling hash are both 2.5x slower. Are we sure the quoted statement is true? -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: Default setting for enable_hashagg_disk
On Mon, Jul 20, 2020 at 09:17:21AM -0400, Tom Lane wrote: Tomas Vondra writes: There's a minor problem here, though - these stats were collected before we fixed the tlist issue, so hashagg was spilling about 10x the amount of data compared to sort+groupagg. So maybe that's the first thing we should do, before contemplating changes to the costing - collecting fresh data. I can do that, if needed. +1. I'm not sure if we still need to do anything, but we definitely can't tell on the basis of data that doesn't reliably reflect what the code does now. OK, will do. The hardware is busy doing something else at the moment, but I'll do the tests and report results in a couple days. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Default setting for enable_hashagg_disk
Tomas Vondra writes: > There's a minor problem here, though - these stats were collected before > we fixed the tlist issue, so hashagg was spilling about 10x the amount > of data compared to sort+groupagg. So maybe that's the first thing we > should do, before contemplating changes to the costing - collecting > fresh data. I can do that, if needed. +1. I'm not sure if we still need to do anything, but we definitely can't tell on the basis of data that doesn't reliably reflect what the code does now. regards, tom lane
Re: Default setting for enable_hashagg_disk
On Sun, Jul 19, 2020 at 02:17:15PM -0700, Jeff Davis wrote: On Sat, 2020-07-18 at 21:15 -0400, Tom Lane wrote: Jeff Davis writes: > What is your opinion about pessimizing the HashAgg disk costs (not > affecting HashAgg plans expected to stay in memory)? Tomas Vondra > presented some evidence that Sort had some better IO patterns in > some > cases that weren't easily reflected in a principled way in the cost > model. Hm, was that in some other thread? I didn't find any such info in a quick look through this one. https://www.postgresql.org/message-id/2df2e0728d48f498b9d6954b5f9080a34535c385.camel%40j-davis.com FWIW the two messages to look at are these two: 1) report with initial data https://www.postgresql.org/message-id/20200519151202.u2p2gpiawoaznsv2%40development 2) updated stats, with the block pre-allocation and tlist projection https://www.postgresql.org/message-id/20200521001255.kfaihp3afv6vy6uq%40development But I'm not convinced we actually need to tweak the costing - we've ended up fixing two things, and I think a lot of the differences in I/O patterns disappeared thanks to this. For sort, the stats of request sizes look like this: type | bytes | count | pct --+-+---+--- RA | 131072 | 26034 | 59.92 RA | 16384 | 6160 | 14.18 RA |8192 | 3636 | 8.37 RA | 32768 | 3406 | 7.84 RA | 65536 | 3270 | 7.53 RA | 24576 | 361 | 0.83 ... W| 1310720 | 8070 | 34.26 W| 262144 | 1213 | 5.15 W| 524288 | 1056 | 4.48 W| 1056768 | 689 | 2.93 W| 786432 | 292 | 1.24 W| 802816 | 199 | 0.84 ... And for the hashagg, it looks like this: type | bytes | count | pct --+-++ RA | 131072 | 200816 | 70.93 RA |8192 | 23640 | 8.35 RA | 16384 | 19324 | 6.83 RA | 32768 | 19279 | 6.81 RA | 65536 | 19273 | 6.81 ... W| 1310720 | 18000 | 65.91 W| 524288 | 2074 | 7.59 W| 1048576 |660 | 2.42 W|8192 |409 | 1.50 W| 786432 |354 | 1.30 ... so it's actually a tad better than sort, because larger proportion of both reads and writes is in larger chunks (reads 128kB, writes 1280kB). I think the device had default read-ahead setting, which I assume explains the 128kB. For the statistics of deltas between requests - for sort type | block_delta | count | pct --+-+---+--- RA | 256 | 13432 | 30.91 RA | 16 | 3291 | 7.57 RA | 32 | 3272 | 7.53 RA | 64 | 3266 | 7.52 RA | 128 | 2877 | 6.62 RA |1808 | 1278 | 2.94 RA | -2320 | 483 | 1.11 RA | 28928 | 386 | 0.89 ... W|2560 | 7856 | 33.35 W|2064 | 4921 | 20.89 W|2080 | 586 | 2.49 W| 30960 | 300 | 1.27 W|2160 | 253 | 1.07 W|1024 | 248 | 1.05 ... and for hashagg: type | block_delta | count | pct --+-++--- RA | 256 | 180955 | 63.91 RA | 32 | 19274 | 6.81 RA | 64 | 19273 | 6.81 RA | 128 | 19264 | 6.80 RA | 16 | 19203 | 6.78 RA | 30480 | 9835 | 3.47 At first this might look worse than sort, but 256 sectors matches the 128kB from the request size stats, and it's good match (64% vs. 70%). There's a minor problem here, though - these stats were collected before we fixed the tlist issue, so hashagg was spilling about 10x the amount of data compared to sort+groupagg. So maybe that's the first thing we should do, before contemplating changes to the costing - collecting fresh data. I can do that, if needed. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Default setting for enable_hashagg_disk
On Sat, Jul 18, 2020 at 3:04 PM Jeff Davis wrote: > > I think the entire discussion > > is way out ahead of any field evidence that we need such a knob. > > In the absence of evidence, our default position ought to be to > > keep it simple, not to accumulate backwards-compatibility kluges. > > Fair enough. I think that was where Stephen and Amit were coming from, > as well. > That would lessen the number of changed plans, but we could easily > remove the pessimization without controversy later if it turned out to > be unnecessary, or if we further optimize HashAgg IO. Does this mean that we've reached a final conclusion on hashagg_avoid_disk_plan for Postgres 13, which is that it should be removed? If so, I'd appreciate it if you took care of it. I don't think that we need to delay its removal until the details of the HashAgg cost pessimization are finalized. (I expect that that will be totally uncontroversial.) Thanks -- Peter Geoghegan
Re: Default setting for enable_hashagg_disk
On Sat, 2020-07-18 at 21:15 -0400, Tom Lane wrote: > Jeff Davis writes: > > What is your opinion about pessimizing the HashAgg disk costs (not > > affecting HashAgg plans expected to stay in memory)? Tomas Vondra > > presented some evidence that Sort had some better IO patterns in > > some > > cases that weren't easily reflected in a principled way in the cost > > model. > > Hm, was that in some other thread? I didn't find any such info > in a quick look through this one. https://www.postgresql.org/message-id/2df2e0728d48f498b9d6954b5f9080a34535c385.camel%40j-davis.com Regards, Jeff Davis
Re: Default setting for enable_hashagg_disk
On Sun, Jul 19, 2020 at 4:38 AM Stephen Frost wrote: > > (The only reason I'm in favor of heap_mem[_multiplier] is that it > > seems like it might be possible to use it to get *better* plans > > than before. I do not see it as a backwards-compatibility knob.) > > I still don't think a hash_mem-type thing is really the right direction > to go in, even if making a distinction between memory used for sorting > and memory used for hashing is, and I'm of the general opinion that we'd > be thinking about doing something better and more appropriate- except > for the fact that we're talking about adding this in during beta. > > In other words, if we'd stop trying to shoehorn something in, which > we're doing because we're in beta, we'd very likely be talking about all > of this in a very different way and probably be contemplating something > like a query_mem that provides for an overall memory limit and which > favors memory for hashing over memory for sorting, etc. > At minimum we'd need a patch we would be happy with dropping in should there be user complaints. And once this conversation ends with that in hand I have my doubts whether there will be interest, or even project desirability, in working toward a "better" solution should this one prove itself "good enough". And as it seems unlikely that this patch would foreclose on other promising solutions, combined with there being a non-trivial behavioral change that we've made, suggests to me that we might as well just deploy whatever short-term solution we come up with now. As for hashagg_avoid_disk_plan... The physical processes we are modelling here: 1. Processing D amount of records takes M amount of memory 2. Processing D amount of records in-memory takes T time per record while doing the same on-disk takes V time per record 3. Processing D amount of records via some other plan has an effective cost U 3. V >> T (is strictly greater than) 4. Having chosen a value for M that ensures T it is still possible for V to end up used Thus: If we get D wrong the user can still tweak the system by changing the hash_mem_multiplier (this is strictly better than v12 which used work_mem) Setting hashagg_avoid_disk_plan = off provides a means to move V infinitely far away from T (set to on by default, off reverts to v12 behavior). There is no way for the user to move V's relative position toward T (n/a in v12) The only way to move T is to make it infinitely large by setting enable_hashagg = off (same as in v12) Is hashagg_disk_cost_multiplier = [0.0, 1,000,000,000.0] i.e., (T * hashagg_disk_cost_multiplier == V) doable? It has a nice symmetry with hash_mem_multiplier and can move V both toward and away from T. To the extent T is tunable or not in v12 it can remain the same in v13. David J.
Re: Default setting for enable_hashagg_disk
Stephen Frost writes: > In other words, if we'd stop trying to shoehorn something in, which > we're doing because we're in beta, we'd very likely be talking about all > of this in a very different way and probably be contemplating something > like a query_mem that provides for an overall memory limit and which > favors memory for hashing over memory for sorting, etc. Even if we were at the start of the dev cycle rather than its end, I'm not sure I agree. Yes, replacing work_mem with some more-holistic approach would be great. But that's a research project, one that we can't be sure will yield fruit on any particular schedule. (Seeing that we've understood this to be a problem for *decades*, I would tend to bet on a longer not shorter time frame for a solution.) I think that if we are worried about hashagg-spill behavior in the near term, we have to have some fix that's not conditional on solving that very large problem. The only other practical alternative is "do nothing for v13", and I agree with the camp that doesn't like that. regards, tom lane
Re: Default setting for enable_hashagg_disk
Greetings, * Tom Lane (t...@sss.pgh.pa.us) wrote: > Jeff Davis writes: > > On Fri, 2020-07-17 at 18:38 -0700, Peter Geoghegan wrote: > >> There is also the separate question of what to do about the > >> hashagg_avoid_disk_plan GUC (this is a separate open item that > >> requires a separate resolution). Tom leans slightly towards removing > >> it now. Is your position about the same as before? > > > Yes, I think we should have that GUC (hashagg_avoid_disk_plan) for at > > least one release. > > You'e being optimistic about it being possible to remove a GUC once > we ship it. That seems to be a hard sell most of the time. Agreed. > I'm honestly a bit baffled about the level of fear being expressed > around this feature. We have *frequently* made changes that would > change query plans, perhaps not 100.00% for the better, and never > before have we had this kind of bikeshedding about whether it was > necessary to be able to turn it off. I think the entire discussion > is way out ahead of any field evidence that we need such a knob. > In the absence of evidence, our default position ought to be to > keep it simple, not to accumulate backwards-compatibility kluges. +100 > (The only reason I'm in favor of heap_mem[_multiplier] is that it > seems like it might be possible to use it to get *better* plans > than before. I do not see it as a backwards-compatibility knob.) I still don't think a hash_mem-type thing is really the right direction to go in, even if making a distinction between memory used for sorting and memory used for hashing is, and I'm of the general opinion that we'd be thinking about doing something better and more appropriate- except for the fact that we're talking about adding this in during beta. In other words, if we'd stop trying to shoehorn something in, which we're doing because we're in beta, we'd very likely be talking about all of this in a very different way and probably be contemplating something like a query_mem that provides for an overall memory limit and which favors memory for hashing over memory for sorting, etc. Thanks, Stephen signature.asc Description: PGP signature
Re: Default setting for enable_hashagg_disk
Jeff Davis writes: > What is your opinion about pessimizing the HashAgg disk costs (not > affecting HashAgg plans expected to stay in memory)? Tomas Vondra > presented some evidence that Sort had some better IO patterns in some > cases that weren't easily reflected in a principled way in the cost > model. Hm, was that in some other thread? I didn't find any such info in a quick look through this one. > That would lessen the number of changed plans, but we could easily > remove the pessimization without controversy later if it turned out to > be unnecessary, or if we further optimize HashAgg IO. Trying to improve our cost models under-the-hood seems like a perfectly reasonable activity to me. regards, tom lane
Re: Default setting for enable_hashagg_disk
On Sat, 2020-07-18 at 14:30 -0400, Tom Lane wrote: > You'e being optimistic about it being possible to remove a GUC once > we ship it. That seems to be a hard sell most of the time. If nothing else, a repeat of this thread in a year or two to discuss removing a GUC doesn't seem appealing. > I think the entire discussion > is way out ahead of any field evidence that we need such a knob. > In the absence of evidence, our default position ought to be to > keep it simple, not to accumulate backwards-compatibility kluges. Fair enough. I think that was where Stephen and Amit were coming from, as well. What is your opinion about pessimizing the HashAgg disk costs (not affecting HashAgg plans expected to stay in memory)? Tomas Vondra presented some evidence that Sort had some better IO patterns in some cases that weren't easily reflected in a principled way in the cost model. That would lessen the number of changed plans, but we could easily remove the pessimization without controversy later if it turned out to be unnecessary, or if we further optimize HashAgg IO. Regards, Jeff Davis
Re: Default setting for enable_hashagg_disk
On Sat, Jul 18, 2020 at 11:30 AM Tom Lane wrote: > Jeff Davis writes: > > Yes, I think we should have that GUC (hashagg_avoid_disk_plan) for at > > least one release. > > You'e being optimistic about it being possible to remove a GUC once > we ship it. That seems to be a hard sell most of the time. You've said that you're +0.5 on removing this GUC, while Jeff seems to be about -0.5 (at least that's my take). It's hard to see a way towards closing out the hashagg_avoid_disk_plan open item if that's our starting point. The "do we need to keep hashagg_avoid_disk_plan?" question is fundamentally a value judgement IMV. I believe that you both understand each other's perspectives. I also suspect that no pragmatic compromise will be possible -- we can either have the hashagg_avoid_disk_plan GUC or not have it. ISTM that we're deadlocked, at least in a technical or procedural sense. Does that understanding seem accurate to you both? -- Peter Geoghegan
Re: Default setting for enable_hashagg_disk
Jeff Davis writes: > On Fri, 2020-07-17 at 18:38 -0700, Peter Geoghegan wrote: >> There is also the separate question of what to do about the >> hashagg_avoid_disk_plan GUC (this is a separate open item that >> requires a separate resolution). Tom leans slightly towards removing >> it now. Is your position about the same as before? > Yes, I think we should have that GUC (hashagg_avoid_disk_plan) for at > least one release. You'e being optimistic about it being possible to remove a GUC once we ship it. That seems to be a hard sell most of the time. I'm honestly a bit baffled about the level of fear being expressed around this feature. We have *frequently* made changes that would change query plans, perhaps not 100.00% for the better, and never before have we had this kind of bikeshedding about whether it was necessary to be able to turn it off. I think the entire discussion is way out ahead of any field evidence that we need such a knob. In the absence of evidence, our default position ought to be to keep it simple, not to accumulate backwards-compatibility kluges. (The only reason I'm in favor of heap_mem[_multiplier] is that it seems like it might be possible to use it to get *better* plans than before. I do not see it as a backwards-compatibility knob.) regards, tom lane
Re: Default setting for enable_hashagg_disk
On Fri, 2020-07-17 at 18:38 -0700, Peter Geoghegan wrote: > There is also the separate question of what to do about the > hashagg_avoid_disk_plan GUC (this is a separate open item that > requires a separate resolution). Tom leans slightly towards removing > it now. Is your position about the same as before? Yes, I think we should have that GUC (hashagg_avoid_disk_plan) for at least one release. Clealy, a lot of plans will change. For any GROUP BY where there are a lot of groups, there was only one choice in v12 and now there are two choices in v13. Obviously I think most of those changes will be for the better, but some regressions are bound to happen. Giving users some time to adjust, and for us to tune the cost model based on user feedback, seems prudent. Are there other examples of widespread changes in plans where we *didn't* have a GUC? There are many GUCs for controlling parallism, JIT, etc. Regards, Jeff Davis
Re: Default setting for enable_hashagg_disk
On Fri, Jul 17, 2020 at 5:13 PM Jeff Davis wrote: > The idea is growing on me a bit. It doesn't give exactly v12 behavior, > but it does offer another lever that might tackle a lot of the > practical cases. Cool. > If I were making the decision alone, I'd still choose > the escape hatch based on simplicity, but I'm fine with this approach > as well. There is also the separate question of what to do about the hashagg_avoid_disk_plan GUC (this is a separate open item that requires a separate resolution). Tom leans slightly towards removing it now. Is your position about the same as before? > The patch itself looks reasonable to me. I don't see a lot of obvious > dangers, but perhaps someone would like to take a closer look at the > planner changes as you suggest. It would be good to get further input on the patch from somebody else, particularly the planner aspects. My intention is to commit the patch myself. I was the primary advocate for hash_mem_multiplier, so it seems as if I should own it. (You may have noticed that I just pushed the preparatory local-variable-renaming patch, to get that piece out of the way.) -- Peter Geoghegan
Re: Default setting for enable_hashagg_disk
On Tue, 2020-07-14 at 21:12 -0700, Peter Geoghegan wrote: > Attached is a WIP patch implementing hash_mem_multiplier, with 1.0 as > the GUC's default value (i.e. the patch introduces no behavioral > changes by default). The first patch in the series renames some local > variables whose name is made ambiguous by the second, main patch. The idea is growing on me a bit. It doesn't give exactly v12 behavior, but it does offer another lever that might tackle a lot of the practical cases. If I were making the decision alone, I'd still choose the escape hatch based on simplicity, but I'm fine with this approach as well. The patch itself looks reasonable to me. I don't see a lot of obvious dangers, but perhaps someone would like to take a closer look at the planner changes as you suggest. Regards, Jeff Davis
Re: Default setting for enable_hashagg_disk
On Mon, Jul 13, 2020 at 9:47 AM Alvaro Herrera wrote: > I'm in favor of hash_mem_multiplier. I think a >1 default is more > sensible than =1 in the long run, but if strategic vote is what we're > doing, then I support the =1 option. Attached is a WIP patch implementing hash_mem_multiplier, with 1.0 as the GUC's default value (i.e. the patch introduces no behavioral changes by default). The first patch in the series renames some local variables whose name is made ambiguous by the second, main patch. Since the patch doesn't add a new work_mem-style GUC, but existing consumers of work_mem expect something like that, the code is structured in a way that allows the planner and executor to pretend that there really is a work_mem-style GUC called hash_mem, which they can determine the value of by calling the get_hash_mem() function. This seemed like the simplest approach overall. I placed the get_hash_mem() function in nodeHash.c, which is a pretty random place for it. If anybody has any better ideas about where it should live, please say so. ISTM that the planner changes are where there's mostly likely to be problems. Reviewers should examine consider_groupingsets_paths() in detail. -- Peter Geoghegan v2-0001-Rename-hash_mem-local-variable.patch Description: Binary data v2-0002-Add-hash_mem_multiplier-GUC.patch Description: Binary data
Re: Default setting for enable_hashagg_disk
On Tue, Jul 14, 2020 at 12:46 PM Robert Haas wrote: > - I thought the problem we were trying to solve here was that, in v12, > if the planner thinks that your hashagg will fit in memory when really > it doesn't, you will get good performance because we'll cheat; in v13, > you'll get VERY bad performance because we won't. That is the problem we started out with. I propose to solve a broader problem that I believe mostly encompasses the original problem (it's an "inventor's paradox" situation). Although the exact degree to which it truly addresses the original problem will vary across installations, I believe that it will go a very long way towards cutting down on problems for users upgrading to Postgres 13 generally. > - So, if hash_mem_multiplier affects both planning and execution, it > doesn't really solve the problem. Neither does adjusting the existing > work_mem setting. Imagine that you have two queries. The planner > thinks Q1 will use 1GB of memory for a HashAgg but it will actually > need 2GB. It thinks Q2 will use 1.5GB for a HashAgg but it will > actually need 3GB. If you plan using a 1GB memory limit, Q1 will pick > a HashAgg and perform terribly when it spills. Q2 will pick a > GroupAggregate which will be OK but not great. If you plan with a 2GB > memory limit, Q1 will pick a HashAgg and will not spill so now it will > be in great shape. But Q2 will pick a HashAgg and then spill so it > will stink. Oops. Maybe I missed your point here. The problem is not so much that we'll get HashAggs that spill -- there is nothing intrinsically wrong with that. While it's true that the I/O pattern is not as sequential as a similar group agg + sort, that doesn't seem like the really important factor here. The really important factor is that in-memory HashAggs can be blazingly fast relative to *any* alternative strategy -- be it a HashAgg that spills, or a group aggregate + sort that doesn't spill, whatever. We're mostly concerned about keeping the one available fast strategy than we are about getting a new, generally slow strategy. There will be no problems at all unless and until we're short on memory, because you can just increase work_mem and everything works out, regardless of the details. Obviously the general problems we anticipate only crop up when increasing work_mem stops being a viable DBA strategy. By teaching the system to have at least a crude appreciation of the value of memory when hashing vs when sorting, the system is often able to give much more memory to Hash aggs (and hash joins). Increasing hash_mem_multiplier (maybe while also decreasing work_mem) will be beneficial when we take memory from things that don't really need so much, like sorts (or even CTE tuplestores) -- we reduce the memory pressure without paying a commensurate price in system throughput (maybe even only a very small hit). As a bonus, everything going faster may actually *reduce* the memory usage for the system as a whole, even as individual queries use more memory. Under this scheme, it may well not matter that you cannot cheat (Postgres 12 style) anymore, because you'll be able to use the memory that is available sensibly -- regardless of whether or not the group estimates are very good (we have to have more than zero faith in the estimates -- they can be bad without being terrible). Maybe no amount of tuning can ever restore the desirable Postgres 12 performance characteristics you came to rely on, but remaining "regressions" are probably cases where the user was flying pretty close to the sun OOM-wise all along. They may have been happy with Postgres 12, but at a certain point that really is something that you have to view as a fool's paradise, even if like me you happen to be a memory Keynesian. Really big outliers tend to be rare and therefore something that the user can afford to have go slower. It's the constant steady stream of medium-sized hash aggs that we mostly need to worry about. To the extent that that's true, hash_mem_multiplier addresses the problem on the table. > - An escape hatch that prevents spilling at execution time *does* > solve this problem, but now suppose we add a Q3 which the planner > thinks will use 512MB of memory but at execution time it will actually > consume 512GB due to the row count estimate being 1024x off. So if you > enable the escape hatch to get back to a situation where Q1 and Q2 > both perform acceptably, then Q3 makes your system OOM. Right. Nothing stops these two things from being true at the same time. > - If you were to instead introduce a GUC like what I proposed before, > which allows the execution-time memory usage to exceed what was > planned, but only by a certain margin, then you can set > hash_mem_execution_overrun_multiplier_thingy=2.5 and call it a day. > Now, no matter how you set work_mem, you're fine. Depending on the > value you choose for work_mem, you may get group aggregates for some > of the queries. If you set it large enough that you get ha
Re: Default setting for enable_hashagg_disk
On Mon, Jul 13, 2020 at 2:50 PM Peter Geoghegan wrote: > Primarily in favor of escape hatch: > > Jeff, > DavidR, > Pavel, > Andres, > Robert ??, > Amit ?? > > Primarily in favor of hash_mem/hash_mem_multiplier: > > PeterG, > Tom, > Alvaro, > Tomas, > Justin, > DavidG, > Jonathan Katz > > There are clear problems with this summary, including for example the > fact that Robert weighed in before the hash_mem/hash_mem_multiplier > proposal was even on the table. What he actually said about it [1] > seems closer to hash_mem, so I feel that putting him in that bucket is > a conservative assumption on my part. Same goes for Amit, who warmed > to the idea of hash_mem_multiplier recently. (Though I probably got > some detail wrong, in which case please correct me.) My view is: - I thought the problem we were trying to solve here was that, in v12, if the planner thinks that your hashagg will fit in memory when really it doesn't, you will get good performance because we'll cheat; in v13, you'll get VERY bad performance because we won't. - So, if hash_mem_multiplier affects both planning and execution, it doesn't really solve the problem. Neither does adjusting the existing work_mem setting. Imagine that you have two queries. The planner thinks Q1 will use 1GB of memory for a HashAgg but it will actually need 2GB. It thinks Q2 will use 1.5GB for a HashAgg but it will actually need 3GB. If you plan using a 1GB memory limit, Q1 will pick a HashAgg and perform terribly when it spills. Q2 will pick a GroupAggregate which will be OK but not great. If you plan with a 2GB memory limit, Q1 will pick a HashAgg and will not spill so now it will be in great shape. But Q2 will pick a HashAgg and then spill so it will stink. Oops. - An escape hatch that prevents spilling at execution time *does* solve this problem, but now suppose we add a Q3 which the planner thinks will use 512MB of memory but at execution time it will actually consume 512GB due to the row count estimate being 1024x off. So if you enable the escape hatch to get back to a situation where Q1 and Q2 both perform acceptably, then Q3 makes your system OOM. - If you were to instead introduce a GUC like what I proposed before, which allows the execution-time memory usage to exceed what was planned, but only by a certain margin, then you can set hash_mem_execution_overrun_multiplier_thingy=2.5 and call it a day. Now, no matter how you set work_mem, you're fine. Depending on the value you choose for work_mem, you may get group aggregates for some of the queries. If you set it large enough that you get hash aggregates, then Q1 and Q2 will avoid spilling (which works but is slow) because the overrun is less than 2x. Q3 will spill, so you won't OOM. Wahoo! - I do agree in general that it makes more sense to allow hash_work_mem > sort_work_mem, and even to make that the default. Allowing the same budget for both is unreasonable, because I think we have good evidence that inadequate memory has a severe impact on hashing operations but usually only a fairly mild effect on sorting operations, except in the case where the underrun is severe. That is, if you need 1GB of memory for a sort and you only get 768MB, the slowdown is much much less severe than if the same thing happens for a hash. If you have 10MB of memory, both are going to suck, but that's kinda unavoidable. - If you hold my feet to the fire and ask me to choose between a Boolean escape hatch (rather than a multiplier-based one) and hash_mem_multiplier, gosh, I don't know. I guess the Boolean escape hatch? I mean it's a pretty bad solution, but at least if I have that I can get both Q1 and Q2 to perform well at the same time, and I guess I'm no worse off than I was in v12. The hash_mem_multiplier thing, assuming it affects both planning and execution, seems like a very good idea in general, but I guess I don't see how it helps with this problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Default setting for enable_hashagg_disk
On Mon, Jul 13, 2020 at 9:50 PM Peter Geoghegan wrote: > > On Mon, Jul 13, 2020 at 6:13 AM Stephen Frost wrote: > > Yes, increasing work_mem isn't unusual, at all. > > It's unusual as a way of avoiding OOMs! > > > Eh? That's not at all what it looks like- they were getting OOM's > > because they set work_mem to be higher than the actual amount of memory > > they had and the Sort before the GroupAgg was actually trying to use all > > that memory. The HashAgg ended up not needing that much memory because > > the aggregated set wasn't actually that large. If anything, this shows > > exactly what Jeff's fine work here is (hopefully) going to give us- the > > option to plan a HashAgg in such cases, since we can accept spilling to > > disk if we end up underestimate, or take advantage of that HashAgg > > being entirely in memory if we overestimate. > > I very specifically said that it wasn't a case where something like > hash_mem would be expected to make all the difference. > > > Having looked back, I'm not sure that I'm really in the minority > > regarding the proposal to add this at this time either- there's been a > > few different comments that it's too late for v13 and/or that we should > > see if we actually end up with users seriously complaining about the > > lack of a separate way to specify the memory for a given node type, > > and/or that if we're going to do this then we should have a broader set > > of options covering other nodes types too, all of which are positions > > that I agree with. > > By proposing to do nothing at all, you are very clearly in a small > minority. While (for example) I might have debated the details with > David Rowley a lot recently, and you couldn't exactly say that we're > in agreement, our two positions are nevertheless relatively close > together. > > AFAICT, the only other person that has argued that we should do > nothing (have no new GUC) is Bruce, which was a while ago now. (Amit > said something similar, but has since softened his opinion [1]). > To be clear, my vote for PG13 is not to do anything till we have clear evidence of regressions. In the email you quoted, I was trying to say that due to parallelism we might not have the problem for which we are planning to provide an escape-hatch or hash_mem GUC. I think the reason for the delay in getting to the agreement is that there is no clear evidence for the problem (user-reported cases or results of some benchmarks like TPC-H) unless I have missed something. Having said that, I understand that we have to reach some conclusion to close this open item and if the majority of people are in-favor of escape-hatch or hash_mem solution then we have to do one of those. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Default setting for enable_hashagg_disk
"David G. Johnston" writes: > To be clear, by "escape hatch" you mean "add a GUC that instructs the > PostgreSQL executor to ignore hash_mem when deciding whether to spill the > contents of the hash table to disk - IOW to never spill the contents of a > hash table to disk"? If so that seems separate from whether to add a > hash_mem GUC to provide finer grained control - people may well want both. If we define the problem narrowly as "allow people to get back exactly the pre-v13 behavior", then yeah you'd need an escape hatch of that sort. We should not, however, forget that the pre-v13 behavior is pretty darn problematic. It's hard to see why anyone would really want to get back exactly "never spill even if it leads to OOM". The proposals for allowing a higher-than-work_mem, but not infinite, spill boundary seem to me to be a reasonable way to accommodate cases where the old behavior is accidentally preferable to what v13 does right now. Moreover, such a knob seems potentially useful in its own right, at least as a stopgap until we figure out how to generalize or remove work_mem. (Which might be a long time.) I'm not unalterably opposed to providing an escape hatch of the other sort, but right now I think the evidence for needing it isn't there. If we get field complaints that can't be resolved with the "raise the spill threshold by X" approach, we could reconsider. But that approach seems a whole lot less brittle than "raise the spill threshold to infinity", so I think we should start with the former type of fix. regards, tom lane
Re: Default setting for enable_hashagg_disk
On Mon, Jul 13, 2020 at 12:57 PM David G. Johnston wrote: > To be clear, by "escape hatch" you mean "add a GUC that instructs the > PostgreSQL executor to ignore hash_mem when deciding whether to spill the > contents of the hash table to disk - IOW to never spill the contents of a > hash table to disk"? Yes, that's what that means. > If so that seems separate from whether to add a hash_mem GUC to provide finer > grained control - people may well want both. They might want the escape hatch too, as an additional measure, but my assumption is that anybody in favor of the hash_mem/hash_mem_multiplier proposal takes that position because they think that it's the principled solution. That's the kind of subtlety that is bound to get lost when summarizing general sentiment at a high level. In any case no individual has seriously argued that there is a simultaneous need for both -- at least not yet. This thread is already enormous, and very hard to keep up with. I'm trying to draw a line under the discussion. For my part, I have compromised on the important question of the default value of hash_mem_multiplier -- I am writing a new version of the patch that makes the default 1.0 (i.e. no behavioral changes by default). > I would prefer DavidJ as an abbreviation - my middle initial can be dropped > when referring to me. Sorry about that. -- Peter Geoghegan
Re: Default setting for enable_hashagg_disk
On Mon, Jul 13, 2020 at 11:50 AM Peter Geoghegan wrote: > > Primarily in favor of escape hatch: > > Jeff, > DavidR, > Pavel, > Andres, > Robert ??, > Amit ?? > > To be clear, by "escape hatch" you mean "add a GUC that instructs the PostgreSQL executor to ignore hash_mem when deciding whether to spill the contents of the hash table to disk - IOW to never spill the contents of a hash table to disk"? If so that seems separate from whether to add a hash_mem GUC to provide finer grained control - people may well want both. Primarily in favor of hash_mem/hash_mem_multiplier: > > PeterG, > Tom, > Alvaro, > Tomas, > Justin, > DavidG, > Jonathan Katz > > I would prefer DavidJ as an abbreviation - my middle initial can be dropped when referring to me. David J.
Re: Default setting for enable_hashagg_disk
On Mon, Jul 13, 2020 at 7:25 AM David Rowley wrote: > I think it would be good if we could try to move towards getting > consensus here rather than reiterating our arguments over and over. +1 > Updated summary: > * For hash_mem = Tomas [7], Justin [16] > * For hash_mem_multiplier with a default > 1.0 = DavidG [21] > * For hash_mem_multiplier with default = 1.0 = PeterG [15][0], Tom [20][24] > * hash_mem out of scope for PG13 = Bruce [8], Andres [9] > * hashagg_mem default to -1 meaning use work_mem = DavidR [23] (2nd > preference) > * Escape hatch that can be removed later when we get something better > = Jeff [11], DavidR [12], Pavel [13], Andres [14], Justin [1] > * Add enable_hashagg_spill = Tom [2] (I'm unclear on this proposal. > Does it affect the planner or executor or both?) (updated opinion in > [20]) > * Maybe do nothing until we see how things go during beta = Bruce [3], Amit > [10] > * Just let users set work_mem = Stephen [21], Alvaro [4] (Alvaro > changed his mind after Andres pointed out that changes other nodes in > the plan too [25]) > * Swap enable_hashagg for a GUC that specifies when spilling should > occur. -1 means work_mem = Robert [17], Amit [18] > * hash_mem does not solve the problem = Tomas [6] (changed his mind in [7]) I don't think that hashagg_mem needs to be considered here, because you were the only one that spoke out in favor of that idea, and it's your second preference in any case (maybe Tom was in favor of such a thing at one point, but he clearly favors hash_mem/hash_mem_multiplier now so it hardly matters). I don't think that hashagg_mem represents a meaningful compromise between the escape hatch and hash_mem/hash_mem_multiplier in any case. (I would *prefer* the escape hatch to hashagg_mem, since at least the escape hatch is an "honest" escape hatch.) ISTM that there are three basic approaches to resolving this open item that remain: 1. Do nothing. 2. Add an escape hatch. 3. Add hash_mem/hash_mem_multiplier. Many people (e.g., Tom, Jeff, you, Andres, myself) have clearly indicated that doing nothing is simply a non-starter. It's not just that it doesn't get a lot of votes -- it's something that is strongly opposed. We can rule it out right away. This is where it gets harder. Many of us have views that are won't easily fit into buckets. For example, even though I myself proposed hash_mem/hash_mem_multiplier, I've said that I can live with the escape hatch. Similarly, Jeff favors the escape hatch, but has said that he can live with hash_mem/hash_mem_multiplier. And, Andres said to me privately that he thinks that hash_mem could be a good idea, even though he opposes it now due to release management considerations. Even still, I think that it's possible to divide people into two camps on this without grossly misrepresenting anybody. Primarily in favor of escape hatch: Jeff, DavidR, Pavel, Andres, Robert ??, Amit ?? Primarily in favor of hash_mem/hash_mem_multiplier: PeterG, Tom, Alvaro, Tomas, Justin, DavidG, Jonathan Katz There are clear problems with this summary, including for example the fact that Robert weighed in before the hash_mem/hash_mem_multiplier proposal was even on the table. What he actually said about it [1] seems closer to hash_mem, so I feel that putting him in that bucket is a conservative assumption on my part. Same goes for Amit, who warmed to the idea of hash_mem_multiplier recently. (Though I probably got some detail wrong, in which case please correct me.) ISTM that there is a majority of opinion in favor of hash_mem/hash_mem_multiplier. If you assume that I have this wrong, and that we're simply deadlocked, then it becomes a matter for the RMT. I strongly doubt that that changes the overall outcome, since this year's RMT members happen to all be in favor of the hash_mem/hash_mem_multiplier proposal on an individual basis. [1] https://www.postgresql.org/message-id/ca+tgmobyv9+t-wjx-ctpdqurcgt1thz1ml3v1nxc4m4g-h6...@mail.gmail.com -- Peter Geoghegan
Re: Default setting for enable_hashagg_disk
On Mon, Jul 13, 2020 at 12:47:36PM -0400, Alvaro Herrera wrote: > On 2020-Jul-13, Jeff Davis wrote: > > > On Tue, 2020-07-14 at 02:25 +1200, David Rowley wrote: > > > Updated summary: > > > * For hash_mem = Tomas [7], Justin [16] > > > * For hash_mem_multiplier with a default > 1.0 = DavidG [21] > > > * For hash_mem_multiplier with default = 1.0 = PeterG [15][0], Tom > > > [20][24] > > > > I am OK with these options, but I still prefer a simple escape hatch. > > I'm in favor of hash_mem_multiplier. I think a >1 default is more > sensible than =1 in the long run, but if strategic vote is what we're > doing, then I support the =1 option. I recanted and support hash_mem_multiplier (or something supporting that behavior, even if it also supports an absolute/scalar value). https://www.postgresql.org/message-id/20200703145620.gk4...@telsasoft.com 1.0 (or -1) is fine, possibly to be >= 1.0 in master at a later date. -- Justin
Re: Default setting for enable_hashagg_disk
Alvaro Herrera writes: > I'm in favor of hash_mem_multiplier. I think a >1 default is more > sensible than =1 in the long run, but if strategic vote is what we're > doing, then I support the =1 option. FWIW, I also think that we'll eventually end up with >1 default. But the evidence to support that isn't really there yet, so I'm good with 1.0 default to start with. regards, tom lane
Re: Default setting for enable_hashagg_disk
On 2020-Jul-13, Jeff Davis wrote: > On Tue, 2020-07-14 at 02:25 +1200, David Rowley wrote: > > Updated summary: > > * For hash_mem = Tomas [7], Justin [16] > > * For hash_mem_multiplier with a default > 1.0 = DavidG [21] > > * For hash_mem_multiplier with default = 1.0 = PeterG [15][0], Tom > > [20][24] > > I am OK with these options, but I still prefer a simple escape hatch. I'm in favor of hash_mem_multiplier. I think a >1 default is more sensible than =1 in the long run, but if strategic vote is what we're doing, then I support the =1 option. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Default setting for enable_hashagg_disk
On Mon, Jul 13, 2020 at 6:13 AM Stephen Frost wrote: > Yes, increasing work_mem isn't unusual, at all. It's unusual as a way of avoiding OOMs! > Eh? That's not at all what it looks like- they were getting OOM's > because they set work_mem to be higher than the actual amount of memory > they had and the Sort before the GroupAgg was actually trying to use all > that memory. The HashAgg ended up not needing that much memory because > the aggregated set wasn't actually that large. If anything, this shows > exactly what Jeff's fine work here is (hopefully) going to give us- the > option to plan a HashAgg in such cases, since we can accept spilling to > disk if we end up underestimate, or take advantage of that HashAgg > being entirely in memory if we overestimate. I very specifically said that it wasn't a case where something like hash_mem would be expected to make all the difference. > Having looked back, I'm not sure that I'm really in the minority > regarding the proposal to add this at this time either- there's been a > few different comments that it's too late for v13 and/or that we should > see if we actually end up with users seriously complaining about the > lack of a separate way to specify the memory for a given node type, > and/or that if we're going to do this then we should have a broader set > of options covering other nodes types too, all of which are positions > that I agree with. By proposing to do nothing at all, you are very clearly in a small minority. While (for example) I might have debated the details with David Rowley a lot recently, and you couldn't exactly say that we're in agreement, our two positions are nevertheless relatively close together. AFAICT, the only other person that has argued that we should do nothing (have no new GUC) is Bruce, which was a while ago now. (Amit said something similar, but has since softened his opinion [1]). [1] https://postgr.es.m/m/CAA4eK1+KMSQuOq5Gsj-g-pYec_8zgGb4K=xrznbcccnaumf...@mail.gmail.com -- Peter Geoghegan
Re: Default setting for enable_hashagg_disk
On 2020-07-13 16:11, Tomas Vondra wrote: Why is running out of disk space worse experience than running out of memory? Sure, it'll take longer and ultimately the query fails (and if it fills the device used by the WAL then it may also cause shutdown of the main instance due to inability to write WAL). But that can be prevented by moving the temp tablespace and/or setting the temp file limit, as already mentioned. With OOM, if the kernel OOM killer decides to act, it may easily bring down the instance too, and there are much less options to prevent that. Well, that's an interesting point. Depending on the OS setup, by default an out of memory might actually be worse if the OOM killer strikes in an unfortunate way. That didn't happen to me in my tests, so the OS must have been configured differently by default. So maybe a lesson here is that just like we have been teaching users to adjust the OOM killer, we have to teach them now that setting the temp file limit might become more important. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Default setting for enable_hashagg_disk
On Tue, 2020-07-14 at 02:25 +1200, David Rowley wrote: > Updated summary: > * For hash_mem = Tomas [7], Justin [16] > * For hash_mem_multiplier with a default > 1.0 = DavidG [21] > * For hash_mem_multiplier with default = 1.0 = PeterG [15][0], Tom > [20][24] I am OK with these options, but I still prefer a simple escape hatch. > * Maybe do nothing until we see how things go during beta = Bruce > [3], Amit [10] > * Just let users set work_mem = Stephen [21], Alvaro [4] (Alvaro > changed his mind after Andres pointed out that changes other nodes in > the plan too [25]) I am not on board with these options. Regards, Jeff Davis
Re: Default setting for enable_hashagg_disk
On Tue, 14 Jul 2020 at 01:13, Stephen Frost wrote: > Yes, increasing work_mem isn't unusual, at all. What that tweet shows > that I don't think folks who are suggesting things like setting this > factor to 2.0 is that people may have a work_mem configured in the > gigabytes- meaning that a 2.0 value would result in a work_mem of 5GB > and a hash_mem of 10GB. Now, I'm all for telling people to review their > configurations between major versions, but that's a large difference > that's going to be pretty deeply hidden in a 'multiplier' setting. I think Peter seems to be fine with setting the default to 1.0, per [0]. This thread did split off a while back into "Default setting for enable_hashagg_disk (hash_mem)", I did try and summarise who sits where on this in [19]. I think it would be good if we could try to move towards getting consensus here rather than reiterating our arguments over and over. Updated summary: * For hash_mem = Tomas [7], Justin [16] * For hash_mem_multiplier with a default > 1.0 = DavidG [21] * For hash_mem_multiplier with default = 1.0 = PeterG [15][0], Tom [20][24] * hash_mem out of scope for PG13 = Bruce [8], Andres [9] * hashagg_mem default to -1 meaning use work_mem = DavidR [23] (2nd preference) * Escape hatch that can be removed later when we get something better = Jeff [11], DavidR [12], Pavel [13], Andres [14], Justin [1] * Add enable_hashagg_spill = Tom [2] (I'm unclear on this proposal. Does it affect the planner or executor or both?) (updated opinion in [20]) * Maybe do nothing until we see how things go during beta = Bruce [3], Amit [10] * Just let users set work_mem = Stephen [21], Alvaro [4] (Alvaro changed his mind after Andres pointed out that changes other nodes in the plan too [25]) * Swap enable_hashagg for a GUC that specifies when spilling should occur. -1 means work_mem = Robert [17], Amit [18] * hash_mem does not solve the problem = Tomas [6] (changed his mind in [7]) Perhaps people who have managed to follow this thread but not chip in yet can reply quoting the option above that they'd be voting for. Or if you're ok changing your mind to some option that has more votes than the one your name is already against. That might help move this along. David [0] https://www.postgresql.org/message-id/CAH2-Wz=VV6EKFGUJDsHEqyvRk7pCO36BvEoF5sBQry_O6R2=n...@mail.gmail.com [1] https://www.postgresql.org/message-id/20200624031443.gv4...@telsasoft.com [2] https://www.postgresql.org/message-id/2214502.1593019...@sss.pgh.pa.us [3] https://www.postgresql.org/message-id/20200625182512.gc12...@momjian.us [4] https://www.postgresql.org/message-id/20200625224422.GA9653@alvherre.pgsql [5] https://www.postgresql.org/message-id/caa4ek1k0cgk_8hryxsvppgoh_z-ny+uztcfwb2we6baj9dx...@mail.gmail.com [6] https://www.postgresql.org/message-id/20200627104141.gq7d3hm2tvoqgjjs@development [7] https://www.postgresql.org/message-id/20200629212229.n3afgzq6xpxrr4cu@development [8] https://www.postgresql.org/message-id/20200703030001.gd26...@momjian.us [9] https://www.postgresql.org/message-id/20200707171216.jqxrld2jnxwf5...@alap3.anarazel.de [10] https://www.postgresql.org/message-id/CAA4eK1KfPi6iz0hWxBLZzfVOG_NvOVJL=9UQQirWLpaN=ka...@mail.gmail.com [11] https://www.postgresql.org/message-id/8bff2e4e8020c3caa16b61a46918d21b573eaf78.ca...@j-davis.com [12] https://www.postgresql.org/message-id/CAApHDvqFZikXhAGW=ukzkq1_fzhy+xzmuzajinj6rwythh4...@mail.gmail.com [13] https://www.postgresql.org/message-id/cafj8prbf1w4ndz-ynd+muptfizfbs7+cpjc4ob8v9d3x0ms...@mail.gmail.com [14] https://www.postgresql.org/message-id/20200624191433.5gnqgrxfmucex...@alap3.anarazel.de [15] https://www.postgresql.org/message-id/CAH2-WzmD+i1pG6rc1+Cjc4V6EaFJ_qSuKCCHVnH=oruqd-z...@mail.gmail.com [16] https://www.postgresql.org/message-id/20200703024649.gj4...@telsasoft.com [17] https://www.postgresql.org/message-id/ca+tgmobyv9+t-wjx-ctpdqurcgt1thz1ml3v1nxc4m4g-h6...@mail.gmail.com [18] https://www.postgresql.org/message-id/caa4ek1k0cgk_8hryxsvppgoh_z-ny+uztcfwb2we6baj9dx...@mail.gmail.com [19] https://www.postgresql.org/message-id/caaphdvrp1fiev4aql2zscbhi32w+gp01j+qnhwou7y7p-qf...@mail.gmail.com [20] https://www.postgresql.org/message-id/2107841.1594403...@sss.pgh.pa.us [21] https://www.postgresql.org/message-id/20200710141714.gi12...@tamriel.snowman.net [22] https://www.postgresql.org/message-id/CAKFQuwa2gwLa0b%2BmQv5r5A_Q0XWsA2%3D1zQ%2BZ5m4pQprxh-aM4Q%40mail.gmail.com [23] https://www.postgresql.org/message-id/caaphdvpxbhhp566rrjjwgnfs0yoxr53eztz5lhh-jcekvqd...@mail.gmail.com [24] https://www.postgresql.org/message-id/2463591.1594514...@sss.pgh.pa.us [25] https://www.postgresql.org/message-id/20200625225853.GA11137%40alvherre.pgsql
Re: Default setting for enable_hashagg_disk
On Mon, Jul 13, 2020 at 01:51:42PM +0200, Peter Eisentraut wrote: On 2020-04-07 20:20, Jeff Davis wrote: Now that we have Disk-based Hash Aggregation, there are a lot more situations where the planner can choose HashAgg. The enable_hashagg_disk GUC, if set to true, chooses HashAgg based on costing. If false, it only generates a HashAgg path if it thinks it will fit in work_mem, similar to the old behavior (though it wlil now spill to disk if the planner was wrong about it fitting in work_mem). The current default is true. I have an anecdote that might be related to this discussion. I was running an unrelated benchmark suite. With PostgreSQL 12, one query ran out of memory. With PostgreSQL 13, the same query instead ran out of disk space. I bisected this to the introduction of disk-based hash aggregation. Of course, the very point of that feature is to eliminate the out of memory and make use of disk space instead. But running out of disk space is likely to be a worse experience than running out of memory. Also, while it's relatively easy to limit memory use both in PostgreSQL and in the kernel, it is difficult or impossible to limit disk space use in a similar way. Why is running out of disk space worse experience than running out of memory? Sure, it'll take longer and ultimately the query fails (and if it fills the device used by the WAL then it may also cause shutdown of the main instance due to inability to write WAL). But that can be prevented by moving the temp tablespace and/or setting the temp file limit, as already mentioned. With OOM, if the kernel OOM killer decides to act, it may easily bring down the instance too, and there are much less options to prevent that. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Default setting for enable_hashagg_disk
Greetings, * Peter Geoghegan (p...@bowt.ie) wrote: > On Sat, Jul 11, 2020 at 7:22 AM Stephen Frost wrote: > > * Tom Lane (t...@sss.pgh.pa.us) wrote: > > > Have you got a better proposal that is reasonably implementable for v13? > > > (I do not accept the argument that "do nothing" is a better proposal.) > > > So, no, I don't agree that 'do nothing' (except ripping out the one GUC > > that was already added) is a worse proposal than adding another work_mem > > like thing that's only for some nodes types. > > The question was "Have you got a better proposal that is reasonably > implementable for v13?". > > This is anecdotal, but just today somebody on Twitter reported > *increasing* work_mem to stop getting OOMs from group aggregate + > sort: > > https://twitter.com/theDressler/status/1281942941133615104 Yes, increasing work_mem isn't unusual, at all. What that tweet shows that I don't think folks who are suggesting things like setting this factor to 2.0 is that people may have a work_mem configured in the gigabytes- meaning that a 2.0 value would result in a work_mem of 5GB and a hash_mem of 10GB. Now, I'm all for telling people to review their configurations between major versions, but that's a large difference that's going to be pretty deeply hidden in a 'multiplier' setting. I'm still wholly unconvinced that we need such a setting, just to be clear, but I don't think there's any way it'd be reasonable to have it set to something other than "whatever work_mem is" by default- and it needs to actually be "what work_mem is" and not "have the same default value" or *everyone* would have to configure it. > It was possible to fix the problem in this instance, since evidently > there wasn't anything else that really did try to consume ~5 GB of > work_mem memory. Evidently the memory isn't available in any general > sense, so there are no OOMs now. Nevertheless, we can expect OOMs on > this server just as soon as there is a real need to do a ~5GB sort, > regardless of anything else. Eh? That's not at all what it looks like- they were getting OOM's because they set work_mem to be higher than the actual amount of memory they had and the Sort before the GroupAgg was actually trying to use all that memory. The HashAgg ended up not needing that much memory because the aggregated set wasn't actually that large. If anything, this shows exactly what Jeff's fine work here is (hopefully) going to give us- the option to plan a HashAgg in such cases, since we can accept spilling to disk if we end up underestimate, or take advantage of that HashAgg being entirely in memory if we overestimate. > I don't think that this kind of perverse effect is uncommon. Hash > aggregate can naturally be far faster than group agg + sort, Hash agg > can naturally use a lot less memory in many cases, and we have every > reason to think that grouping estimates are regularly totally wrong. I'm confused as to why we're talking about the relative performance of a HashAgg vs. a Sort+GroupAgg- of course the HashAgg is going to be faster if it's got enough memory, but that's a constraint we have to consider and deal with because, otherwise, the query can end up failing and potentially impacting other queries or activity on the system, including resulting in the entire database system falling over due to the OOM Killer firing and killing a process and the database ending up restarting and going through crash recovery, which is going to be quite a bit worse than performance maybe not being great. > You're significantly underestimating the risk. Of... what? That we'll end up getting worse performance because we underestimated the size of the result set and we end up spilling to disk with the HashAgg? I think I'm giving that risk the amount of concern it deserves- which is, frankly, not very much. Users who run into that issue, as this tweet *also* showed, are familiar with work_mem and can tune it to address that. This reaction to demand a new GUC to break up work_mem into pieces strikes me as unjustified, and doing so during beta makes it that much worse. Having looked back, I'm not sure that I'm really in the minority regarding the proposal to add this at this time either- there's been a few different comments that it's too late for v13 and/or that we should see if we actually end up with users seriously complaining about the lack of a separate way to specify the memory for a given node type, and/or that if we're going to do this then we should have a broader set of options covering other nodes types too, all of which are positions that I agree with. Thanks, Stephen signature.asc Description: PGP signature
Re: Default setting for enable_hashagg_disk
On 2020-07-13 14:16, David Rowley wrote: Isn't that what temp_file_limit is for? Yeah, I guess that is so rarely used that I had forgotten about it. So maybe that is also something that more users will want to be aware of. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Default setting for enable_hashagg_disk
On Mon, 13 Jul 2020 at 23:51, Peter Eisentraut wrote: > I have an anecdote that might be related to this discussion. > > I was running an unrelated benchmark suite. With PostgreSQL 12, one > query ran out of memory. With PostgreSQL 13, the same query instead ran > out of disk space. I bisected this to the introduction of disk-based > hash aggregation. Of course, the very point of that feature is to > eliminate the out of memory and make use of disk space instead. But > running out of disk space is likely to be a worse experience than > running out of memory. Also, while it's relatively easy to limit memory > use both in PostgreSQL and in the kernel, it is difficult or impossible > to limit disk space use in a similar way. Isn't that what temp_file_limit is for? David
Re: Default setting for enable_hashagg_disk
On 2020-04-07 20:20, Jeff Davis wrote: Now that we have Disk-based Hash Aggregation, there are a lot more situations where the planner can choose HashAgg. The enable_hashagg_disk GUC, if set to true, chooses HashAgg based on costing. If false, it only generates a HashAgg path if it thinks it will fit in work_mem, similar to the old behavior (though it wlil now spill to disk if the planner was wrong about it fitting in work_mem). The current default is true. I have an anecdote that might be related to this discussion. I was running an unrelated benchmark suite. With PostgreSQL 12, one query ran out of memory. With PostgreSQL 13, the same query instead ran out of disk space. I bisected this to the introduction of disk-based hash aggregation. Of course, the very point of that feature is to eliminate the out of memory and make use of disk space instead. But running out of disk space is likely to be a worse experience than running out of memory. Also, while it's relatively easy to limit memory use both in PostgreSQL and in the kernel, it is difficult or impossible to limit disk space use in a similar way. I don't have a solution or proposal here, I just want to mention this as a possibility and suggest that we look out for similar experiences. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Default setting for enable_hashagg_disk
On Sat, Jul 11, 2020 at 3:30 AM Tom Lane wrote: > > Stephen Frost writes: > > I don't see hash_mem as being any kind of proper fix- it's just punting > > to the user saying "we can't figure this out, how about you do it" and, > > worse, it's in conflict with how we already ask the user that question. > > Turning it into a multiplier doesn't change that either. > > Have you got a better proposal that is reasonably implementable for v13? > (I do not accept the argument that "do nothing" is a better proposal.) > > I agree that hash_mem is a stopgap, whether it's a multiplier or no, > but at this point it seems difficult to avoid inventing a stopgap. > Getting rid of the process-global work_mem setting is a research project, > and one I wouldn't even count on having results from for v14. In the > meantime, it seems dead certain that there are applications for which > the current behavior will be problematic. > If this is true then certainly it adds more weight to the argument for having a solution like hash_mem or some other escape-hatch. I know it would be difficult to get the real-world data but why not try TPC-H or similar workloads at a few different scale_factor/size? I was checking some old results with me for TPC-H runs and I found that many of the plans were using Finalize GroupAggregate and Partial GroupAggregate kinds of plans, there were few where I saw Partial HashAggregate being used but it appears on a random check that GroupAggregate seems to be used more. It could be that after parallelism GroupAggregate plans are getting preference but I am not sure about this. However, even if that is not true, I think after the parallel aggregates the memory-related thing is taken care of to some extent automatically because I think after that each worker doing partial aggregation can be allowed to consume work_mem memory. So, probably the larger aggregates which are going to give better performance by consuming more memory would already be parallelized and would have given the desired results. Now, allowing aggregates to use more memory via hash_mem kind of thing is beneficial in non-parallel cases but for cases where parallelism is used it could be worse because now each work will be entitled to use more memory. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Default setting for enable_hashagg_disk
On Sat, Jul 11, 2020 at 10:26:22PM -0700, David G. Johnston wrote: On Saturday, July 11, 2020, Tom Lane wrote: "David G. Johnston" writes: > On Sat, Jul 11, 2020 at 5:47 PM Tom Lane wrote: >> It seems like a lot of the disagreement here is focused on Peter's >> proposal to make hash_mem_multiplier default to 2.0. But it doesn't >> seem to me that that's a critical element of the proposal. Why not just >> make it default to 1.0, thus keeping the default behavior identical >> to what it is now? > If we don't default it to something other than 1.0 we might as well just > make it memory units and let people decide precisely what they want to use > instead of adding the complexity of a multiplier. Not sure how that follows? The advantage of a multiplier is that it tracks whatever people might do to work_mem automatically. I was thinking that setting -1 would basically do that. I think Tom meant that the multiplier would automatically track any changes to work_mem, and adjust the hash_mem accordingly. With -1 (and the GUC in units) you could only keep it exactly equal to work_mem, but then as soon as you change it you'd have to update both. In general I'd view work_mem as the base value that people twiddle to control executor memory consumption. Having to also twiddle this other value doesn't seem especially user-friendly. I’ll admit I don’t have a feel for what is or is not user-friendly when setting these GUCs in a session to override the global defaults. But as far as the global defaults I say it’s a wash between (32mb, -1) -> (32mb, 48mb) and (32mb, 1.0) -> (32mb, 1.5) If you want 96mb for the session/query hash setting it to 96mb is invariant, whilesetting it to 3.0 means it can change in the future if the system work_mem changes. Knowing the multiplier is 1.5 and choosing 64mb for work_mem in the session is possible but also mutable and has side-effects. If the user is going to set both values to make it invariant we are back to it being a wash. I don’t believe using a multiplier will promote better comprehension for why this setting exists compared to “-1 means use work_mem but you can override a subset if you want.” Is having a session level memory setting be mutable something we want to introduce? Is it more user-friendly? I still think it should be in simple units, TBH. We already have somewhat similar situation with cost parameters, where we often say that seq_page_cost = 1.0 is the baseline for the other cost parameters, yet we have not coded that as multipliers. If we find that's a poor default, we can always change it later; >> but it seems to me that the evidence for a higher default is >> a bit thin at this point. > So "your default is 1.0 unless you installed the new database on or after > 13.4 in which case it's 2.0"? What else would be new? See e.g. 848ae330a. (Note I'm not suggesting that we'd change it in a minor release.) Minor release update is what I had thought, and to an extent was making possible by not using the multiplier upfront. I agree options are wide open come v14 and beyond. David J. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Default setting for enable_hashagg_disk
On Sat, Jul 11, 2020 at 08:47:54PM -0400, Tom Lane wrote: Tomas Vondra writes: I don't know, but one of the main arguments against simply suggesting people to bump up work_mem (if they're hit by the hashagg spill in v13) was that it'd increase overall memory usage for them. It seems strange to then propose a new GUC set to a default that would result in higher memory usage *for everyone*. It seems like a lot of the disagreement here is focused on Peter's proposal to make hash_mem_multiplier default to 2.0. But it doesn't seem to me that that's a critical element of the proposal. Why not just make it default to 1.0, thus keeping the default behavior identical to what it is now? If we find that's a poor default, we can always change it later; but it seems to me that the evidence for a higher default is a bit thin at this point. You're right, I was specifically pushing against that aspect of the proposal. Sorry for not making that clearer, I assumed it's clear from the context of this (sub)thread. I agree making it 1.0 (or equal to work_mem, if it's not a multiplier) by default, but allowing it to be increased if needed would address most of the spilling issues. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Default setting for enable_hashagg_disk
On Saturday, July 11, 2020, Tom Lane wrote: > "David G. Johnston" writes: > > On Sat, Jul 11, 2020 at 5:47 PM Tom Lane wrote: > >> It seems like a lot of the disagreement here is focused on Peter's > >> proposal to make hash_mem_multiplier default to 2.0. But it doesn't > >> seem to me that that's a critical element of the proposal. Why not just > >> make it default to 1.0, thus keeping the default behavior identical > >> to what it is now? > > > If we don't default it to something other than 1.0 we might as well just > > make it memory units and let people decide precisely what they want to > use > > instead of adding the complexity of a multiplier. > > Not sure how that follows? The advantage of a multiplier is that it > tracks whatever people might do to work_mem automatically. > I was thinking that setting -1 would basically do that. > In general > I'd view work_mem as the base value that people twiddle to control > executor memory consumption. Having to also twiddle this other value > doesn't seem especially user-friendly. I’ll admit I don’t have a feel for what is or is not user-friendly when setting these GUCs in a session to override the global defaults. But as far as the global defaults I say it’s a wash between (32mb, -1) -> (32mb, 48mb) and (32mb, 1.0) -> (32mb, 1.5) If you want 96mb for the session/query hash setting it to 96mb is invariant, whilesetting it to 3.0 means it can change in the future if the system work_mem changes. Knowing the multiplier is 1.5 and choosing 64mb for work_mem in the session is possible but also mutable and has side-effects. If the user is going to set both values to make it invariant we are back to it being a wash. I don’t believe using a multiplier will promote better comprehension for why this setting exists compared to “-1 means use work_mem but you can override a subset if you want.” Is having a session level memory setting be mutable something we want to introduce? Is it more user-friendly? >> If we find that's a poor default, we can always change it later; > >> but it seems to me that the evidence for a higher default is > >> a bit thin at this point. > > > So "your default is 1.0 unless you installed the new database on or after > > 13.4 in which case it's 2.0"? > > What else would be new? See e.g. 848ae330a. (Note I'm not suggesting > that we'd change it in a minor release.) > Minor release update is what I had thought, and to an extent was making possible by not using the multiplier upfront. I agree options are wide open come v14 and beyond. David J.
Re: Default setting for enable_hashagg_disk
"David G. Johnston" writes: > On Sat, Jul 11, 2020 at 5:47 PM Tom Lane wrote: >> It seems like a lot of the disagreement here is focused on Peter's >> proposal to make hash_mem_multiplier default to 2.0. But it doesn't >> seem to me that that's a critical element of the proposal. Why not just >> make it default to 1.0, thus keeping the default behavior identical >> to what it is now? > If we don't default it to something other than 1.0 we might as well just > make it memory units and let people decide precisely what they want to use > instead of adding the complexity of a multiplier. Not sure how that follows? The advantage of a multiplier is that it tracks whatever people might do to work_mem automatically. In general I'd view work_mem as the base value that people twiddle to control executor memory consumption. Having to also twiddle this other value doesn't seem especially user-friendly. >> If we find that's a poor default, we can always change it later; >> but it seems to me that the evidence for a higher default is >> a bit thin at this point. > So "your default is 1.0 unless you installed the new database on or after > 13.4 in which case it's 2.0"? What else would be new? See e.g. 848ae330a. (Note I'm not suggesting that we'd change it in a minor release.) regards, tom lane
Re: Default setting for enable_hashagg_disk
On Sat, Jul 11, 2020 at 5:47 PM Tom Lane wrote: > Tomas Vondra writes: > > I don't know, but one of the main arguments against simply suggesting > > people to bump up work_mem (if they're hit by the hashagg spill in v13) > > was that it'd increase overall memory usage for them. It seems strange > > to then propose a new GUC set to a default that would result in higher > > memory usage *for everyone*. > > It seems like a lot of the disagreement here is focused on Peter's > proposal to make hash_mem_multiplier default to 2.0. But it doesn't > seem to me that that's a critical element of the proposal. Why not just > make it default to 1.0, thus keeping the default behavior identical > to what it is now? > If we don't default it to something other than 1.0 we might as well just make it memory units and let people decide precisely what they want to use instead of adding the complexity of a multiplier. > If we find that's a poor default, we can always change it later; > but it seems to me that the evidence for a higher default is > a bit thin at this point. > So "your default is 1.0 unless you installed the new database on or after 13.4 in which case it's 2.0"? I'd rather have it be just memory units defaulting to -1 meaning "use work_mem". In the unlikely scenario we decide post-release to want a multiplier > 1.0 we can add the GUC with that default at that point. The multiplier would want to be ignored if hash_mem if set to anything other than -1. David J.
Re: Default setting for enable_hashagg_disk
On Sun, Jul 12, 2020 at 2:27 AM Tom Lane wrote: > David Rowley writes: > > hmm yeah. It's unfortunate, but I'm not sure how I'd have implemented > > it differently. The problem is made worse by the fact that we'll only > > release the memory for the hash table during ExecEndHashJoin(). If the > > planner had some ability to provide the executor with knowledge that > > the node would never be rescanned, then the executor could release the > > memory for the hash table after the join is complete. > > EXEC_FLAG_REWIND seems to fit the bill already? FWIW I have a patch that does exactly that, which I was planning to submit for CF2 along with some other patches that estimate and measure peak executor memory usage.
Re: Default setting for enable_hashagg_disk
I would be okay with a default of 1.0. Peter Geoghegan (Sent from my phone)
Re: Default setting for enable_hashagg_disk
Tomas Vondra writes: > I don't know, but one of the main arguments against simply suggesting > people to bump up work_mem (if they're hit by the hashagg spill in v13) > was that it'd increase overall memory usage for them. It seems strange > to then propose a new GUC set to a default that would result in higher > memory usage *for everyone*. It seems like a lot of the disagreement here is focused on Peter's proposal to make hash_mem_multiplier default to 2.0. But it doesn't seem to me that that's a critical element of the proposal. Why not just make it default to 1.0, thus keeping the default behavior identical to what it is now? If we find that's a poor default, we can always change it later; but it seems to me that the evidence for a higher default is a bit thin at this point. regards, tom lane
Re: Default setting for enable_hashagg_disk
On Sat, Jul 11, 2020 at 09:02:43AM -0700, David G. Johnston wrote: On Sat, Jul 11, 2020 at 7:22 AM Stephen Frost wrote: There now seems to be some suggestions that not only should we have a new GUC, but we should default to having it not be equal to work_mem (or 1.0 or whatever) and instead by higher, to be *twice* or larger whatever the existing work_mem setting is- meaning that people whose systems are working just fine and have good estimates that represent their workload and who get the plans they want may then start seeing differences and increased memory utilization in places that they *don't* want that, all because we're scared that someone, somewhere, might see a regression due to HashAgg spilling to disk. If that increased memory footprint allows the planner to give me a better plan with faster execution and with no OOM I'd be very happy that this change happened. While having a more flexible memory allocation framework is not a primary goal in and of itself it is a nice side-effect. I'm not going to say "let's only set work_mem to 32MB instead of 48MB so I can avoid this faster HashAgg node and instead execute a nested loop (or whatever)". More probable is the user whose current nested loop plan is fast enough and doesn't even realize that with a bit more memory they could get an HashAgg that performs 15% faster. For them this is a win on its face. I don't believe this negatively impacts the super-admin in our user-base and is a decent win for the average and below average admin. Do we really have an issue with plans being chosen while having access to more memory being slower than plans chosen while having less memory? The main risk here is that we choose for a user to consume more memory than they expected and they report OOM issues to us. We tell them to set this new GUC to 1.0. But that implies they are getting many non-HashAgg plans produced when with a bit more memory those HashAgg plans would have been chosen. If they get those faster plans without OOM it's a win, if it OOMs it's a loss. I'm feeling optimistic here and we'll get considerably more wins than losses. How loss-averse do we need to be here though? Npte we can give the upgrading user advance notice of our loss-aversion level and they can simply disagree and set it to 1.0 and/or perform more thorough testing. So being optimistic feels like the right choice. I don't know, but one of the main arguments against simply suggesting people to bump up work_mem (if they're hit by the hashagg spill in v13) was that it'd increase overall memory usage for them. It seems strange to then propose a new GUC set to a default that would result in higher memory usage *for everyone*. Of course, having such GUC with a default a multiple of work_mem might be a win overall - or maybe not. I don't have a very good idea how many people will get bitten by this, and how many will get speedup (and how significant the speedup would be). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Default setting for enable_hashagg_disk
On Sat, Jul 11, 2020 at 4:23 PM Tomas Vondra wrote: > I find that example rather suspicious. I mean, what exactly in the > GroupAgg plan would consume this memory? Surely it'd have to be some > node below the grouping, but sort shouldn't do that, no? > > Seems strange. Well, I imagine hash aggregate manages to use much less memory than the equivalent groupagg's sort, even though to the optimizer it appears as if hash agg should end up using more memory (which is not allowed by the optimizer when it exceeds work_mem, regardless of whether or not it's faster). It may also be relevant that Hash agg can use less memory simply by being faster. Going faster could easily reduce the memory usage for the system as a whole, even when you assume individual group agg nodes use more memory for as long as they run. So in-memory hash agg is effectively less memory hungry. It's not a great example of a specific case that we'd regress by not having hash_mem/hash_mem_multiplier. It's an overestimate where older releases accidentally got a bad, slow plan, not an underestimate where older releases "lived beyond their means but got away with it" by getting a good, fast plan. ISTM that the example is a good example of the strange dynamics involved. > I agree grouping estimates are often quite off, and I kinda agree with > introducing hash_mem (or at least with the concept that hashing is more > sensitive to amount of memory than sort). Not sure it's the right espace > hatch to the hashagg spill problem, but maybe it is. The hash_mem/hash_mem_multiplier proposal aims to fix the problem directly, and not be an escape hatch, because we don't like escape hatches. I think that that probably fixes many or most of the problems in practice, at least assuming that the admin is willing to tune it. But a small number of remaining installations may still need a "true" escape hatch. There is an argument for having both, though I hope that the escape hatch can be avoided. -- Peter Geoghegan
Re: Default setting for enable_hashagg_disk
On Fri, Jul 10, 2020 at 10:00 PM David Rowley wrote: > hmm yeah. It's unfortunate, but I'm not sure how I'd have implemented > it differently. The problem is made worse by the fact that we'll only > release the memory for the hash table during ExecEndHashJoin(). If the > planner had some ability to provide the executor with knowledge that > the node would never be rescanned, then the executor could release the > memory for the hash table after the join is complete. For now, we'll > need to live with the fact that an Append containing many children > doing hash joins will mean holding onto all that memory until the > executor is shutdown :-( > > There's room to make improvements there, for sure, but not for PG13. I think that we're stuck with the idea that partitionwise join uses up to one work_mem allocation per partition until we deprecate work_mem as a concept. Anyway, I only talked about partitionwise join because that was your example. I could just as easily have picked on parallel hash join instead, which is something that I was involved in myself (kind of). This is more or less a negative consequence of the incremental approach we have taken here, which is a collective failure. I have difficulty accepting that something like hash_mem_multiplier cannot be accepted because it risks making the consequence of questionable designs even worse. The problem remains that the original assumption just isn't very robust, and isn't something that the user has granular control over. In general it makes sense that a design in a stable branch is assumed to be the norm that new things need to respect, and not the other way around. But there must be some limit to how far that's taken. > It sounds interesting, but it also sounds like a new feature > post-beta. Perhaps it's better we minimise the scope of the change to > be a minimal fix just for the behaviour we predict some users might > not like. That's an understandable interpretation of the hash_mem/hash_mem_multiplier proposal on the table, and yet one that I disagree with. I consider it highly desirable to have a GUC that can be tuned in a generic and high level way, on general principle. We don't really do escape hatches, and I'd rather avoid adding one now (though it's far preferable to doing nothing, which I consider totally out of the question). Pursuing what you called hashagg_mem is a compromise that will make neither of us happy. It seems like an escape hatch by another name. I would rather just go with your original proposal instead, especially if that's the only thing that'll resolve the problem in front of us. -- Peter Geoghegan
Re: Default setting for enable_hashagg_disk
On Sat, Jul 11, 2020 at 09:49:43AM -0700, Peter Geoghegan wrote: On Sat, Jul 11, 2020 at 7:22 AM Stephen Frost wrote: * Tom Lane (t...@sss.pgh.pa.us) wrote: > Have you got a better proposal that is reasonably implementable for v13? > (I do not accept the argument that "do nothing" is a better proposal.) So, no, I don't agree that 'do nothing' (except ripping out the one GUC that was already added) is a worse proposal than adding another work_mem like thing that's only for some nodes types. The question was "Have you got a better proposal that is reasonably implementable for v13?". This is anecdotal, but just today somebody on Twitter reported *increasing* work_mem to stop getting OOMs from group aggregate + sort: https://twitter.com/theDressler/status/1281942941133615104 It was possible to fix the problem in this instance, since evidently there wasn't anything else that really did try to consume ~5 GB of work_mem memory. Evidently the memory isn't available in any general sense, so there are no OOMs now. Nevertheless, we can expect OOMs on this server just as soon as there is a real need to do a ~5GB sort, regardless of anything else. I find that example rather suspicious. I mean, what exactly in the GroupAgg plan would consume this memory? Surely it'd have to be some node below the grouping, but sort shouldn't do that, no? Seems strange. I don't think that this kind of perverse effect is uncommon. Hash aggregate can naturally be far faster than group agg + sort, Hash agg can naturally use a lot less memory in many cases, and we have every reason to think that grouping estimates are regularly totally wrong. You're significantly underestimating the risk. I agree grouping estimates are often quite off, and I kinda agree with introducing hash_mem (or at least with the concept that hashing is more sensitive to amount of memory than sort). Not sure it's the right espace hatch to the hashagg spill problem, but maybe it is. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Default setting for enable_hashagg_disk
On Sat, Jul 11, 2020 at 7:22 AM Stephen Frost wrote: > * Tom Lane (t...@sss.pgh.pa.us) wrote: > > Have you got a better proposal that is reasonably implementable for v13? > > (I do not accept the argument that "do nothing" is a better proposal.) > So, no, I don't agree that 'do nothing' (except ripping out the one GUC > that was already added) is a worse proposal than adding another work_mem > like thing that's only for some nodes types. The question was "Have you got a better proposal that is reasonably implementable for v13?". This is anecdotal, but just today somebody on Twitter reported *increasing* work_mem to stop getting OOMs from group aggregate + sort: https://twitter.com/theDressler/status/1281942941133615104 It was possible to fix the problem in this instance, since evidently there wasn't anything else that really did try to consume ~5 GB of work_mem memory. Evidently the memory isn't available in any general sense, so there are no OOMs now. Nevertheless, we can expect OOMs on this server just as soon as there is a real need to do a ~5GB sort, regardless of anything else. I don't think that this kind of perverse effect is uncommon. Hash aggregate can naturally be far faster than group agg + sort, Hash agg can naturally use a lot less memory in many cases, and we have every reason to think that grouping estimates are regularly totally wrong. You're significantly underestimating the risk. -- Peter Geoghegan
Re: Default setting for enable_hashagg_disk
On Sat, Jul 11, 2020 at 7:22 AM Stephen Frost wrote: > There now seems to be some suggestions that not only should we have a > new GUC, but we should default to having it not be equal to work_mem (or > 1.0 or whatever) and instead by higher, to be *twice* or larger whatever > the existing work_mem setting is- meaning that people whose systems are > working just fine and have good estimates that represent their workload > and who get the plans they want may then start seeing differences and > increased memory utilization in places that they *don't* want that, all > because we're scared that someone, somewhere, might see a regression due > to HashAgg spilling to disk. > If that increased memory footprint allows the planner to give me a better plan with faster execution and with no OOM I'd be very happy that this change happened. While having a more flexible memory allocation framework is not a primary goal in and of itself it is a nice side-effect. I'm not going to say "let's only set work_mem to 32MB instead of 48MB so I can avoid this faster HashAgg node and instead execute a nested loop (or whatever)". More probable is the user whose current nested loop plan is fast enough and doesn't even realize that with a bit more memory they could get an HashAgg that performs 15% faster. For them this is a win on its face. I don't believe this negatively impacts the super-admin in our user-base and is a decent win for the average and below average admin. Do we really have an issue with plans being chosen while having access to more memory being slower than plans chosen while having less memory? The main risk here is that we choose for a user to consume more memory than they expected and they report OOM issues to us. We tell them to set this new GUC to 1.0. But that implies they are getting many non-HashAgg plans produced when with a bit more memory those HashAgg plans would have been chosen. If they get those faster plans without OOM it's a win, if it OOMs it's a loss. I'm feeling optimistic here and we'll get considerably more wins than losses. How loss-averse do we need to be here though? Npte we can give the upgrading user advance notice of our loss-aversion level and they can simply disagree and set it to 1.0 and/or perform more thorough testing. So being optimistic feels like the right choice. David J.
Re: Default setting for enable_hashagg_disk
David Rowley writes: > hmm yeah. It's unfortunate, but I'm not sure how I'd have implemented > it differently. The problem is made worse by the fact that we'll only > release the memory for the hash table during ExecEndHashJoin(). If the > planner had some ability to provide the executor with knowledge that > the node would never be rescanned, then the executor could release the > memory for the hash table after the join is complete. EXEC_FLAG_REWIND seems to fit the bill already? regards, tom lane
Re: Default setting for enable_hashagg_disk
Greetings, * Tom Lane (t...@sss.pgh.pa.us) wrote: > Stephen Frost writes: > > I don't see hash_mem as being any kind of proper fix- it's just punting > > to the user saying "we can't figure this out, how about you do it" and, > > worse, it's in conflict with how we already ask the user that question. > > Turning it into a multiplier doesn't change that either. > > Have you got a better proposal that is reasonably implementable for v13? > (I do not accept the argument that "do nothing" is a better proposal.) > > I agree that hash_mem is a stopgap, whether it's a multiplier or no, > but at this point it seems difficult to avoid inventing a stopgap. > Getting rid of the process-global work_mem setting is a research project, > and one I wouldn't even count on having results from for v14. In the > meantime, it seems dead certain that there are applications for which > the current behavior will be problematic. hash_mem seems like a cleaner > and more useful stopgap than the "escape hatch" approach, at least to me. Have we heard from people running actual applications where there is a problem with raising work_mem to simply match what's already happening with the v12 behavior? Sure, there's been some examples on this thread of people who know the backend well showing how the default work_mem will cause the v13 HashAgg to spill to disk when given a query which has poor estimates, and that's slower than v12 where it ignored work_mem and used a bunch of memory, but it was also shown that raising work_mem addresses that issue and brings v12 and v13 back in line. There was a concern raised that other nodes might then use more memory- but there's nothing new there, if you wanted to avoid batching with a HashJoin in v12 you'd have exactly the same issue, and yet folks raise work_mem all the time to address this, and to get that HashAgg plan in the first place too when the estimates aren't so far off. There now seems to be some suggestions that not only should we have a new GUC, but we should default to having it not be equal to work_mem (or 1.0 or whatever) and instead by higher, to be *twice* or larger whatever the existing work_mem setting is- meaning that people whose systems are working just fine and have good estimates that represent their workload and who get the plans they want may then start seeing differences and increased memory utilization in places that they *don't* want that, all because we're scared that someone, somewhere, might see a regression due to HashAgg spilling to disk. So, no, I don't agree that 'do nothing' (except ripping out the one GUC that was already added) is a worse proposal than adding another work_mem like thing that's only for some nodes types. There's no way that we'd even be considering such an approach during the regular development cycle either- there would be calls for a proper wholistic view, at least to the point where every node type that could possibly allocate a reasonable chunk of memory would be covered. Thanks, Stephen signature.asc Description: PGP signature
Re: Default setting for enable_hashagg_disk
On Sat, 11 Jul 2020 at 14:02, Peter Geoghegan wrote: > > On Fri, Jul 10, 2020 at 6:19 PM David Rowley wrote: > > If we have to have a new GUC, my preference would be hashagg_mem, > > where -1 means use work_mem and a value between 64 and MAX_KILOBYTES > > would mean use that value. We'd need some sort of check hook to > > disallow 0-63. I really am just failing to comprehend why we're > > contemplating changing the behaviour of Hash Join here. > > I don't understand why parititonwise hash join consumes work_mem in > the way it does. I assume that the reason is something like "because > that behavior was the easiest to explain", or perhaps "people that use > partitioning ought to be able to tune their database well". Or even > "this design avoids an epic pgsql-hackers thread, because of course > every hash table should get its own work_mem". hmm yeah. It's unfortunate, but I'm not sure how I'd have implemented it differently. The problem is made worse by the fact that we'll only release the memory for the hash table during ExecEndHashJoin(). If the planner had some ability to provide the executor with knowledge that the node would never be rescanned, then the executor could release the memory for the hash table after the join is complete. For now, we'll need to live with the fact that an Append containing many children doing hash joins will mean holding onto all that memory until the executor is shutdown :-( There's room to make improvements there, for sure, but not for PG13. > > Of course, I > > understand that that node type also uses a hash table, but why does > > that give it the right to be involved in a change that we're making to > > try and give users the ability to avoid possible regressions with Hash > > Agg? > > It doesn't, exactly. The idea of hash_mem came from similar settings > in another database system that you'll have heard of, that affect all > nodes that use a hash table. I read about this long ago, and thought > that it might make sense to do something similar as a way to improving > work_mem It sounds interesting, but it also sounds like a new feature post-beta. Perhaps it's better we minimise the scope of the change to be a minimal fix just for the behaviour we predict some users might not like. David
Re: Default setting for enable_hashagg_disk
On Fri, Jul 10, 2020 at 6:19 PM David Rowley wrote: > If we get hash_mem > or some variant that is a multiplier of work_mem, then that user is in > exactly the same situation for that plan. i.e there's no ability to > increase the memory allowances for Hash Agg alone. That's true, of course. > If we have to have a new GUC, my preference would be hashagg_mem, > where -1 means use work_mem and a value between 64 and MAX_KILOBYTES > would mean use that value. We'd need some sort of check hook to > disallow 0-63. I really am just failing to comprehend why we're > contemplating changing the behaviour of Hash Join here. I don't understand why parititonwise hash join consumes work_mem in the way it does. I assume that the reason is something like "because that behavior was the easiest to explain", or perhaps "people that use partitioning ought to be able to tune their database well". Or even "this design avoids an epic pgsql-hackers thread, because of course every hash table should get its own work_mem". > Of course, I > understand that that node type also uses a hash table, but why does > that give it the right to be involved in a change that we're making to > try and give users the ability to avoid possible regressions with Hash > Agg? It doesn't, exactly. The idea of hash_mem came from similar settings in another database system that you'll have heard of, that affect all nodes that use a hash table. I read about this long ago, and thought that it might make sense to do something similar as a way to improving work_mem (without replacing it with something completely different to enable things like the "hash teams" design, which should be the long term goal). It's unusual that it took this hashaggs-that-spill issue to make the work_mem situation come to a head, and it's unusual that the proposal on the table doesn't just target hash agg. But it's not *that* unusual. I believe that it makes sense on balance to lump together hash aggregate and hash join, with the expectation that the user might want to tune them for the system as a whole. This is not an escape hatch -- it's something that adds granularity to how work_mem can be tuned in a way that makes sense (but doesn't make perfect sense). It doesn't reflect reality, but I think that it comes closer to reflecting reality than other variations that I can think of, including your hashagg_mem compromise proposal (which is still much better than plain work_mem). In short, hash_mem is relatively conceptually clean, and doesn't unduly burden the user. I understand that you only want to add an escape hatch, which is what hashagg_mem still amounts to. There are negative consequences to the setting affecting hash join, which I am not unconcerned about. On the other hand, hashagg_mem is an escape hatch, and that's ugly in a way that hash_mem isn't. I'm also concerned about that. In the end, I think that the "hash_mem vs. hashagg_mem" question is fundamentally a matter of opinion. -- Peter Geoghegan
Re: Default setting for enable_hashagg_disk
On Fri, Jul 10, 2020 at 6:43 PM David Rowley wrote: > On Sat, 11 Jul 2020 at 13:36, David G. Johnston > wrote: > > If we add a setting that defaults to work_mem then the benefit is > severely reduced. You still have to modify individual queries, but the > change can simply be more targeted than changing work_mem alone. > > I think the idea is that this is an escape hatch to allow users to get > something closer to what PG12 did, but only if they really need it. I > can't quite understand why we need to leave the escape hatch open and > push them halfway through it. I find escape hatches are best left > closed until you really have no choice but to use them. > > The escape hatch dynamic is "the user senses a problem, goes into their query, and modifies some GUCs to make the problem go away". As a user I'd much rather have the odds of my needing to use that escape hatch reduced - especially if that reduction can be done without risk and without any action on my part. It's like having someone in a box right now, and then turning up the heat. We can give them an opening to get out of the box if they need it but we can also give them A/C. For some the A/C may be unnecessary, but also not harmful, while a smaller group will stay in the margin, while for the others it's not enough and use the opening (which they would have done anyway without the A/C). David J.
Re: Default setting for enable_hashagg_disk
On Sat, 11 Jul 2020 at 13:36, David G. Johnston wrote: > If we add a setting that defaults to work_mem then the benefit is severely > reduced. You still have to modify individual queries, but the change can > simply be more targeted than changing work_mem alone. I think the idea is that this is an escape hatch to allow users to get something closer to what PG12 did, but only if they really need it. I can't quite understand why we need to leave the escape hatch open and push them halfway through it. I find escape hatches are best left closed until you really have no choice but to use them. David
Re: Default setting for enable_hashagg_disk
On Fri, Jul 10, 2020 at 6:19 PM David Rowley wrote: > If we have to have a new GUC, my preference would be hashagg_mem, > where -1 means use work_mem and a value between 64 and MAX_KILOBYTES > would mean use that value. We'd need some sort of check hook to > disallow 0-63. I really am just failing to comprehend why we're > contemplating changing the behaviour of Hash Join here. If we add a setting that defaults to work_mem then the benefit is severely reduced. You still have to modify individual queries, but the change can simply be more targeted than changing work_mem alone. I truly desire to have whatever we do provide that ability as well as a default value that is greater than the current work_mem value - which in v12 was being ignored and thus production usages saw memory consumption greater than work_mem. Only a multiplier does this. A multiplier-only solution fixes the problem at hand. A multiplier-or-memory solution adds complexity but provides flexibility. If adding that flexibility is straight-forward I don't see any serious downside other than the complexity of having the meaning of a single GUC's value dependent upon its magnitude. Of course, I > understand that that node type also uses a hash table, but why does > that give it the right to be involved in a change that we're making to > try and give users the ability to avoid possible regressions with Hash > Agg? > If Hash Join isn't affected by the "was allowed to use unlimited amounts of execution memory but now isn't" change then it probably should continue to consult work_mem instead of being changed to use the calculated value (work_mem x multiplier). David J.
Re: Default setting for enable_hashagg_disk
On Sat, 11 Jul 2020 at 12:47, David G. Johnston wrote: > The multiplier seems strictly better than "rely on work_mem alone, i.e., do > nothing"; the detracting factor being one more GUC. Even if one wants to > argue the solution is ugly or imperfect the current state seems worse and a > more perfect option doesn't seem worth waiting for. The multiplier won't > make every single upgrade a non-event but it provides a more than sufficient > amount of control and in the worse case can be effectively ignored by setting > it to 1.0. My argument wasn't related to if the new GUC should be a multiplier of work_mem or an absolute amount of memory. The point I was trying to make was that the solution to add a GUC to allow users to increase the memory Hash Join and Hash Agg for plans which don't contain any other nodes types that use work_mem is the same as doing nothing. As of today, those people could just increase work_mem. If we get hash_mem or some variant that is a multiplier of work_mem, then that user is in exactly the same situation for that plan. i.e there's no ability to increase the memory allowances for Hash Agg alone. If we have to have a new GUC, my preference would be hashagg_mem, where -1 means use work_mem and a value between 64 and MAX_KILOBYTES would mean use that value. We'd need some sort of check hook to disallow 0-63. I really am just failing to comprehend why we're contemplating changing the behaviour of Hash Join here. Of course, I understand that that node type also uses a hash table, but why does that give it the right to be involved in a change that we're making to try and give users the ability to avoid possible regressions with Hash Agg? David
Re: Default setting for enable_hashagg_disk
On Fri, Jul 10, 2020 at 5:16 PM David Rowley wrote: > Stephen mentions in [1] that: > > Users who are actually hit by this in a negative way > > have an option- increase work_mem to reflect what was actually happening > > already. > > Peter is not a fan of that idea, which can only be due to the fact > that will also increase the maximum memory consumption allowed by > other nodes in the plan too. That isn't the only reason for me - the main advantage of hash_mem is that we get to set a default to some multiple greater than 1.0 so that an upgrade to v13 has a region where behavior similar to v12 is effectively maintained. I have no feel for whether that should be 2.0, 4.0, or something else, but 2.0 seemed small and I chose to use a power of 2. My concern is that if we do hash_mem and > have that control the memory allowances for Hash Joins and Hash Aggs, > then that solution is just as good as Stephen's idea when the plan > only contains Hash Joins and Hash Aggs. > > As much as I do want to see us get something to allow users some > reasonable way to get the same performance as they're used to, I'm > concerned that giving users something that works for many of the use > cases is not really going to be as good as giving them something that > works in all their use cases. A user who has a partitioned table > with a good number of partitions and partition-wise joins enabled > might not like it if their Hash Join plan suddenly consumes hash_mem * > nPartitions when they've set hash_mem to 10x of work_mem due to some > other plan that requires that to maintain PG12's performance in PG13. > I don't know enough about the hash join dynamic to comment there but if an admin goes in and changes the system default to 10x in lieu of a targeted fix for a query that actually needs work_mem to be increased to 10 times its current value to work properly I'd say that would be a poor decision. Absent hash_mem they wouldn't update work_mem on their system to 10x its current value in order to upgrade to v13, they'd set work_mem for that query specifically. The same should happen here. Frankly, if admins are on top of their game and measuring and monitoring query performance and memory consumption they would be able to operate in our "do nothing" mode by setting the default for hash_mem to 1.0 and just dole out memory via work_mem as they have always done. Though setting hash_mem to 10x for that single query would reduce their risk of OOM (none of the work_mem consulting nodes would be increased) so having the GUC would be a net win should they avail themselves of it. The multiplier seems strictly better than "rely on work_mem alone, i.e., do nothing"; the detracting factor being one more GUC. Even if one wants to argue the solution is ugly or imperfect the current state seems worse and a more perfect option doesn't seem worth waiting for. The multiplier won't make every single upgrade a non-event but it provides a more than sufficient amount of control and in the worse case can be effectively ignored by setting it to 1.0. Is there some reason to think that having this multiplier with a conservative default of 2.0 would cause an actual problem - and would that scenario have likely caused an OOM anyway in v12? Given that "work_mem can be used many many times in a single query" I'm having trouble imagining such a problem. David J.
Re: Default setting for enable_hashagg_disk
On Sat, 11 Jul 2020 at 10:00, Tom Lane wrote: > > Stephen Frost writes: > > I don't see hash_mem as being any kind of proper fix- it's just punting > > to the user saying "we can't figure this out, how about you do it" and, > > worse, it's in conflict with how we already ask the user that question. > > Turning it into a multiplier doesn't change that either. > > Have you got a better proposal that is reasonably implementable for v13? > (I do not accept the argument that "do nothing" is a better proposal.) > > I agree that hash_mem is a stopgap, whether it's a multiplier or no, > but at this point it seems difficult to avoid inventing a stopgap. > Getting rid of the process-global work_mem setting is a research project, > and one I wouldn't even count on having results from for v14. In the > meantime, it seems dead certain that there are applications for which > the current behavior will be problematic. hash_mem seems like a cleaner > and more useful stopgap than the "escape hatch" approach, at least to me. If we're going to end up going down the route of something like hash_mem for PG13, wouldn't it be better to have something more like hashagg_mem that only adjusts the memory limits for Hash Agg only? Stephen mentions in [1] that: > Users who are actually hit by this in a negative way > have an option- increase work_mem to reflect what was actually happening > already. Peter is not a fan of that idea, which can only be due to the fact that will also increase the maximum memory consumption allowed by other nodes in the plan too. My concern is that if we do hash_mem and have that control the memory allowances for Hash Joins and Hash Aggs, then that solution is just as good as Stephen's idea when the plan only contains Hash Joins and Hash Aggs. As much as I do want to see us get something to allow users some reasonable way to get the same performance as they're used to, I'm concerned that giving users something that works for many of the use cases is not really going to be as good as giving them something that works in all their use cases. A user who has a partitioned table with a good number of partitions and partition-wise joins enabled might not like it if their Hash Join plan suddenly consumes hash_mem * nPartitions when they've set hash_mem to 10x of work_mem due to some other plan that requires that to maintain PG12's performance in PG13. If that user is unable to adjust hash_mem due to that then they're not going to be very satisfied that we've added hash_mem to allow their query to perform as well as it did in PG12. They'll be at the same OOM risk that they were exposed to in PG12 if they were to increase hash_mem here. David [1] https://www.postgresql.org/message-id/20200710143415.gj12...@tamriel.snowman.net
Re: Default setting for enable_hashagg_disk
On Fri, Jul 10, 2020 at 2:50 PM Stephen Frost wrote: > Nothing of what you've said thus far has shown me that there were > material bits of the discussion that I've missed. Maybe that's just because you missed those bits too? > No, that other people > feel differently or have made comments supporting one thing or another > isn't what I would consider material- I'm as allowed my opinions as much > as others, even when I disagree with the majority (or so claimed anyhow- > I've not gone back to count, but I don't claim it to be otherwise > either). You are of course entitled to your opinion. The problem we're trying to address here is paradoxical, in a certain sense. The HashAggs-that-spill patch is somehow not at fault on the one hand, but on the other hand has created this urgent need to ameliorate what is for all intents and purposes a regression. Everything is intertwined. Yes -- this *is* weird! And, I admit that the hash_mem proposal is unorthodox, even ugly -- in fact, I've said words to that effect on perhaps a dozen occasions at this point. This is also weird. I pointed out that my hash_mem proposal was popular because it seemed like it might save time. When I see somebody I know proposing something strange, my first thought is "why are they proposing that?". I might only realize some time later that there are special circumstances that make the proposal much more reasonable than it seemed at first (maybe even completely reasonable). There is no inherent reason why other people supporting the proposal makes it more valid, but in general it does suggest that special circumstances might apply. It guides me in the direction of looking for and understanding what they might be sooner. -- Peter Geoghegan
Re: Default setting for enable_hashagg_disk
Stephen Frost writes: > I don't see hash_mem as being any kind of proper fix- it's just punting > to the user saying "we can't figure this out, how about you do it" and, > worse, it's in conflict with how we already ask the user that question. > Turning it into a multiplier doesn't change that either. Have you got a better proposal that is reasonably implementable for v13? (I do not accept the argument that "do nothing" is a better proposal.) I agree that hash_mem is a stopgap, whether it's a multiplier or no, but at this point it seems difficult to avoid inventing a stopgap. Getting rid of the process-global work_mem setting is a research project, and one I wouldn't even count on having results from for v14. In the meantime, it seems dead certain that there are applications for which the current behavior will be problematic. hash_mem seems like a cleaner and more useful stopgap than the "escape hatch" approach, at least to me. regards, tom lane