Re: [HACKERS] Using quicksort for every external sort run
On Thu, Apr 7, 2016 at 11:10 AM, Robert Haas wrote: > I prefer units of tuples, with the GUC itself therefore being > unitless. I suggest we call the parameter replacement_sort_threshold > and document that (1) the ideal value may depend on the amount of CPU > cache available to running processes, with more cache implying higher > values; and (2) the ideal value may depend somewhat on the input data, > with more correlation implying higher values. And then pick some > value that you think is likely to work well for most people and call > it good. > > If you could prepare a new patch with those changes and also making > the changes requested in my other email, I will try to commit that > before the deadline. Thanks. Attached revision of patch series: * Breaks out the parts you don't want to commit right now, as agreed. These separate patches in the rebased patch series are included here for completeness, but will probably be submitted separately to 9.7. I do still think you should commit 0002-* alongside 0001-*, though, because it's useful to be able to enable the memory context dumps on dev builds to debug external sorting. I won't insist on it, but that is my recommendation. * Fixes "over-zealous assertion" that I pointed out recently. * Replaces replacement_sort_mem GUC with replacement_sort_tuples GUC, since, as discussed, effective cut-off points for using replacement selection for the first run are easier to derive from the size of memtuples (the might-be heap) than from work_mem/maintenance_work_mem (the fraction of all tuplesort memory used that is used for memtuples could be very low in cases with what Tomas called "padding"). Since you didn't get back to me on the name of the GUC, I just ran with the name replacement_sort_tuples, but that's something I'm totally unattached to. Feel free to change it to whatever you prefer, including your original suggestion of replacement_sort_threshold if you still think that works. The new default value that I came up with for replacement_sort_tuples is 150,000 tuples, which is intended as a rough generic break-even point. Note that trace_sort reports how many tuples were in the heap should replacement selection actually be chosen for the first run. 150,000 seems like a high enough generic delta between an out-of-order tuple, and its optimal in-order position; if *that* amount of buffer space to "juggle" tuples isn't enough, it seems unlikely that *anything* will be (anything that is less than 1/2 of the total number of input tuples, at least). Note that I use the term "cache oblivious" in the documentation now, per your suggestion that CPU cache characteristics be addressed. We have traditionally avoided using jargon like that, but I think it works well here. The reader is not required to know the definition. Dropping that term provides bread-crumbs for advance users to put all this together in more detail, which I believe has value. It suggests that increasing work_mem or maintenance_work_mem can have almost no downside provided you don't need that memory for anything else, which is true. I will be glad to see this through. Thanks for your help with this, Robert. -- Peter Geoghegan From 167f9ef6720f87925b67d995a9e7bdf87e72767c Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Sun, 12 Jul 2015 13:14:01 -0700 Subject: [PATCH 5/5] Perform memory prefetching when writing memtuples This patch is based on, but quite distinct to a separately submitted, more general version which performs prefetching in several places [1]. This version now only performs prefetching of each "tuple proper" during the writing of batches of tuples (an entire run, written following a quicksort). The case for prefetching each "tuple proper" at several sites now seems weak due to difference in CPU microarchitecture. However, it might still be that there is a consistent improvement observable when writing out tuples, because that involves a particularly tight inner loop, with relatively predictable processing to hide memory latency behind. A helpful generic prefetch hint may be possible for this case, even if it proves impossible elsewhere. This has been shown to appreciably help on both a POWER7 server processor [2], and an Intel Mobile processor. [1] https://commitfest.postgresql.org/6/305/ [2] CAM3SWZR5rv3+F3FOKf35=dti7otmmcdfoe2vogur0pddg3j...@mail.gmail.com --- config/c-compiler.m4 | 17 + configure | 31 +++ configure.in | 1 + src/backend/utils/sort/tuplesort.c | 14 ++ src/include/c.h| 14 ++ src/include/pg_config.h.in | 3 +++ src/include/pg_config.h.win32 | 3 +++ src/include/pg_config_manual.h | 10 ++ 8 files changed, 93 insertions(+) diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index a7f6773..c181496 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -271,
Re: [HACKERS] Using quicksort for every external sort run
On Thu, Apr 7, 2016 at 11:10 AM, Robert Haas wrote: > I prefer units of tuples, with the GUC itself therefore being > unitless. I suggest we call the parameter replacement_sort_threshold > and document that (1) the ideal value may depend on the amount of CPU > cache available to running processes, with more cache implying higher > values; and (2) the ideal value may depend somewhat on the input data, > with more correlation implying higher values. And then pick some > value that you think is likely to work well for most people and call > it good. I really don't want to bikeshed about this, but I must ask: if the name of the GUC must include the word "threshold", shouldn't it be called quicksort_threshold? My dictionary defines threshold as "any place or point of entering or beginning". But this GUC does not govern where replacement selection begins; it governs where it ends. How do you feel about replacement_sort_tuples? We already use the word "tuple" in the names of GUCs. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Thu, Apr 7, 2016 at 11:05 AM, Robert Haas wrote: > I spent some time today reading through the new 0001 and in general I > think it looks pretty good. Cool. > 1. Changing cost_sort to consider disk access as 90% sequential, 10% > random rather than 75% sequential, 25% random. As far as I can recall > from the thread, zero test results have been posted to demonstrate > that this is a good idea. It also seems completely unprincipled. I think that it's less unprincipled than the existing behavior, which imagines that I/O is a significant cost overall, something that is demonstrably wrong (there is an XXX comment about the existing disk access costings). Still, I agree that there is no logical reason to connect it to the bulk of what I want to do here, except that maybe it would be good if we were more optimistic about the cost of external sorting now. cost_sort() knows nothing about cache efficiency, of course, so naturally we cannot teach it to weigh cache efficiency less heavily. I guess I was worried that the smaller run sizes would put cost_sort() off external sorts even more, even as they became far cheaper. > 2. Restricting the maximum number of tapes to 500. This seems like a > sound change and I don't object to it in theory. But I've seen no > benchmark results which demonstrate that this is a good idea, and it > is quite separate from the core purpose of the patch. Ditto. This is something that could be done separately. We've often pondered if it made any sense at all (e.g. commit message of c65ab0bfa97b71bceae6402498910f4074996279), and I'm sure that it doesn't, but the memory refund stuff in the already memory management patch at least refunds the cost for the final on-the-fly merge (iff state->tuples). > Since time is short, I recommend we remove both of these things from > the patch and you can resubmit them as separate patches later. As far > as I can see, neither of them is so tied into the rest of the patch > that the main part of the patch can't be committed without those > changes. I agree to all this. Now that you've indicated where you stand on replacement_sort_mem, I have all the information I need to produce a new revision. I'll go do that. Thanks -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Thu, Apr 7, 2016 at 1:17 PM, Peter Geoghegan wrote: >> I certainly agree that GUCs that aren't easy to tune are bad. I'm >> wondering whether the fact that this one is hard to tune is something >> that can be fixed. The comments about "padding" - a term I don't >> like, because it to me implies a deliberate attempt to game the >> benchmark when in reality wanting to sort a wide row is entirely >> reasonable - make me wonder if this should be based on a number of >> tuples rather than an amount of memory. If considering the row width >> makes us get the wrong answer, then let's not do that. > > That's a good point. While I don't think it will make it easy to tune > the GUC, it will make it easier. Although, I think that it should > probably still be GUC_UNIT_KB. That should just be something that my > useselection() function compares to the overall size of memtuples > alone when we must initially spill, not the value of > work_mem/maintenance_work_mem. The degree of padding isn't entirely > irrelevant, because not all comparisons will be resolved at the > stup.datum1 level, but it's still clearly an improvement to not have > wide tuples mess with things. > > Would you like me to revise the patch along those lines? Or, do you > prefer units of tuples? Tuples are basically equivalent, but make it > way less obvious what the relationship with CPU cache might be. If I > revise the patch along these lines, I should also reduce the default > replacement_sort_mem to produce roughly equivalent behavior for > non-padded cases. I prefer units of tuples, with the GUC itself therefore being unitless. I suggest we call the parameter replacement_sort_threshold and document that (1) the ideal value may depend on the amount of CPU cache available to running processes, with more cache implying higher values; and (2) the ideal value may depend somewhat on the input data, with more correlation implying higher values. And then pick some value that you think is likely to work well for most people and call it good. If you could prepare a new patch with those changes and also making the changes requested in my other email, I will try to commit that before the deadline. Thanks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Mon, Mar 21, 2016 at 2:01 AM, Peter Geoghegan wrote: > On Thu, Mar 17, 2016 at 1:13 PM, Robert Haas wrote: >> OK, I have now committed 0001 > > I attach a revision of the external quicksort patch and supplementary > small patches, rebased on top of the master branch. I spent some time today reading through the new 0001 and in general I think it looks pretty good. But I think that there is some stuff in there that logically seems to me to deserve to be separate patches. In particular: 1. Changing cost_sort to consider disk access as 90% sequential, 10% random rather than 75% sequential, 25% random. As far as I can recall from the thread, zero test results have been posted to demonstrate that this is a good idea. It also seems completely unprincipled. If the cost of sorts decreases as a result of this patch, it is because we've reduced the CPU cost, not the I/O cost. The changes we're talking about here make I/O more random, not less random, because we will now have more tapes, not fewer; which means merges will have to seek the disk head more frequently, not less frequently. Now, it's tempting to say that this patch should result in some change to the cost model: if the patch doesn't make sorting faster, we shouldn't commit it at all, and if it does, then surely the cost model should change accordingly. But the question for the cost model isn't whether the change to the model somehow reflects the increase in execution speed. It's whether we get better query plans with the change than without. I don't think there's been a degree of review of that aspect of this patch on list that would give me confidence to commit a change like this. 2. Restricting the maximum number of tapes to 500. This seems like a sound change and I don't object to it in theory. But I've seen no benchmark results which demonstrate that this is a good idea, and it is quite separate from the core purpose of the patch. Since time is short, I recommend we remove both of these things from the patch and you can resubmit them as separate patches later. As far as I can see, neither of them is so tied into the rest of the patch that the main part of the patch can't be committed without those changes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Thu, Apr 7, 2016 at 6:55 AM, Robert Haas wrote: >> In reality there will be multiple processes running at the same time (e.g >> backends when running parallel query), significantly reducing the amount of >> cache per process, making the replacement sort inefficient and thus >> eliminating the regressions (by making the master slower). > > Interesting point. The effective use of CPU cache is *absolutely* critical here. I think that this patch is valuable primarily because it makes sorting predictable, and only secondarily because it makes it much faster. Having discrete costs that can be modeled fairly accurately has significant practical benefits for DBAs, and for query optimization, especially when parallel worker sorts must be costed. Inefficient use of CPU cache implies a big overall cost for the server, not just one client; my sorting patches are usually tested on single client cases, but the multi-client cases can be a lot more sympathetic (we saw this with abbreviated keys at one point). I wonder how many DBAs are put off by higher work_mem settings due to issues with replacement selectionthey are effectively denied the ability to set work_mem appropriately across the board, because of this one weak spot. It really is perverse that there is, in effect, a "Blackjack" cost function for sorts, which runs counter to the general intuition that more memory is better. > I certainly agree that GUCs that aren't easy to tune are bad. I'm > wondering whether the fact that this one is hard to tune is something > that can be fixed. The comments about "padding" - a term I don't > like, because it to me implies a deliberate attempt to game the > benchmark when in reality wanting to sort a wide row is entirely > reasonable - make me wonder if this should be based on a number of > tuples rather than an amount of memory. If considering the row width > makes us get the wrong answer, then let's not do that. That's a good point. While I don't think it will make it easy to tune the GUC, it will make it easier. Although, I think that it should probably still be GUC_UNIT_KB. That should just be something that my useselection() function compares to the overall size of memtuples alone when we must initially spill, not the value of work_mem/maintenance_work_mem. The degree of padding isn't entirely irrelevant, because not all comparisons will be resolved at the stup.datum1 level, but it's still clearly an improvement to not have wide tuples mess with things. Would you like me to revise the patch along those lines? Or, do you prefer units of tuples? Tuples are basically equivalent, but make it way less obvious what the relationship with CPU cache might be. If I revise the patch along these lines, I should also reduce the default replacement_sort_mem to produce roughly equivalent behavior for non-padded cases. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
Sorry for not responding to this thread again sooner. I was on vacation Thursday-Sunday, and have been playing catch-up since then. On Sun, Apr 3, 2016 at 8:24 AM, Tomas Vondra wrote: > Secondly, master is faster only if there's enough on-CPU cache for the > replacement sort (for the memtuples heap), but the benchmark is not > realistic in this respect as it only ran 1 query at a time, so it used the > whole cache (6MB for i5, 12MB for Xeon). > > In reality there will be multiple processes running at the same time (e.g > backends when running parallel query), significantly reducing the amount of > cache per process, making the replacement sort inefficient and thus > eliminating the regressions (by making the master slower). Interesting point. > 3) replacement_sort_mem GUC > > I'm not quite sure what's the plan with this GUC. It was useful for > development, but it seems to me it's pretty difficult to tune it in practice > (especially if you don't know the internals, which users generally don't). > > The current patch includes the new GUC right next to work_mem, which seems > rather unfortunate - I do expect users to simply mess with assuming "more is > better" which seems to be rather poor idea. > > So I think we should either remove the GUC entirely, or move it to the > developer section next to trace_sort (and removing it from the conf). I certainly agree that GUCs that aren't easy to tune are bad. I'm wondering whether the fact that this one is hard to tune is something that can be fixed. The comments about "padding" - a term I don't like, because it to me implies a deliberate attempt to game the benchmark when in reality wanting to sort a wide row is entirely reasonable - make me wonder if this should be based on a number of tuples rather than an amount of memory. If considering the row width makes us get the wrong answer, then let's not do that. > BTW couldn't we tune the value automatically for each sort, using the > pg_stats.correlation for the sort keys, when available (increasing the > replacement_sort_mem when correlation is close to 1.0)? Wouldn't that > improve at least some of the regressions? Surely not for 9.6. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Sun, Apr 3, 2016 at 4:08 PM, Greg Stark wrote: >> SELECT * FROM (SELECT * FROM (SELECT a FROM int_test UNION SELECT a >> FROM int_test_padding) gg OFFSET 1e10) ff; > > ISTM OFFSET binds more loosely than UNION so these should be equivalent. Not exactly: postgres=# explain analyze select i from fff union select i from ggg offset 1e10; QUERY PLAN --- Limit (cost=357771.51..357771.51 rows=1 width=4) (actual time=2989.378..2989.378 rows=0 loops=1) -> Unique (cost=345771.50..357771.51 rows=242 width=4) (actual time=2031.044..2930.903 rows=151 loops=1) -> Sort (cost=345771.50..351771.51 rows=242 width=4) (actual time=2031.042..2543.167 rows=242 loops=1) Sort Key: fff.i Sort Method: external merge Disk: 32840kB -> Append (cost=0.00..58620.04 rows=242 width=4) (actual time=0.048..435.408 rows=242 loops=1) -> Seq Scan on fff (cost=0.00..14425.01 rows=101 width=4) (actual time=0.048..100.435 rows=101 loops=1) -> Seq Scan on ggg (cost=0.00..20195.01 rows=141 width=4) (actual time=0.042..138.991 rows=141 loops=1) Planning time: 0.123 ms Execution time: 2999.564 ms (10 rows) postgres=# explain analyze select * from (select i from fff union select i from ggg) fg offset 1e10; QUERY PLAN --- Limit (cost=381771.53..381771.53 rows=1 width=4) (actual time=2982.519..2982.519 rows=0 loops=1) -> Unique (cost=345771.50..357771.51 rows=242 width=4) (actual time=2009.176..2922.874 rows=151 loops=1) -> Sort (cost=345771.50..351771.51 rows=242 width=4) (actual time=2009.174..2522.761 rows=242 loops=1) Sort Key: fff.i Sort Method: external merge Disk: 32840kB -> Append (cost=0.00..58620.04 rows=242 width=4) (actual time=0.056..428.934 rows=242 loops=1) -> Seq Scan on fff (cost=0.00..14425.01 rows=101 width=4) (actual time=0.055..100.806 rows=101 loops=1) -> Seq Scan on ggg (cost=0.00..20195.01 rows=141 width=4) (actual time=0.042..139.994 rows=141 loops=1) Planning time: 0.127 ms Execution time: 2993.294 ms (10 rows) The startup and total costs are greater in the latter case, but the costs match at and below the Unique node. Whether or not this was relevant is probably unimportant, though. My habit is to do the offset outside of the subquery. My theory is that the master branch happened to get a HashAggregate for the 128MB case that caused us both confusion, because it looked cheaper than an external sort + unique when the sort required many passes on the master branch only (where my cost_sort() changes that lower the costing of external sorts were not included). This wasn't a low cardinality case, so the HashAggregate may have only won by a small amount. I suppose that this could happen when the HashAggregate was not predicted to use memory > work_mem, but a sort was. Then, as the sort requires fewer merge passes with more work_mem, the master branch starts to agree with the patch on the cheapest plan once again. The trend of the patch being faster continues, after this one hiccup. This is down to the cost_sort() changes, not the tuplesort.c changes. But this was just a quirk, and the trend still seems clear. This theory seems very likely based on this strange query's numbers for i5 master as work_mem increases: Master: 16.711, 9.94, 4.891, 8.32, 4.88 Patch: 17.23, 9.77, 9.78, 4.95, 4.94 ISTM that master's last and third-from-last cases *both* use a HashAggregate, where the patch behaves more consistently. After all, the patch does smooth the cost function of sorting, an independently useful goal to simply making sorting faster. We don't have to be afraid of crossing an arbitrary, fuzzy threshold. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Sun, Apr 3, 2016 at 12:50 AM, Peter Geoghegan wrote: > 1459308434.753 2016-03-30 05:27:14 CEST STATEMENT: SELECT * FROM > (SELECT a FROM int_test UNION SELECT a FROM int_test_padding OFFSET > 1e10) ff; > > I think that this is invalid, because the query was intended as this: > > SELECT * FROM (SELECT * FROM (SELECT a FROM int_test UNION SELECT a > FROM int_test_padding) gg OFFSET 1e10) ff; ISTM OFFSET binds more loosely than UNION so these should be equivalent. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
I just mean that, as you say, trace_sort is described in the documentation. I don't think we'll end up with any kind of cost model here, so where that would need to happen is only an academic matter. The create index parameter would only be an option for the DBA. That's about the only case I can see working for replacement selection: where indexes can be created with very little memory quickly, by optimistically starting to write out the start of the final index representation almost immediately, before most of the underlying table has even been read in. -- Peter Geoghegan
Re: [HACKERS] Using quicksort for every external sort run
On 04/03/2016 09:41 PM, Peter Geoghegan wrote: Hi Tomas, ... 3) replacement_sort_mem GUC I'm not quite sure what's the plan with this GUC. It was useful for development, but it seems to me it's pretty difficult to tune it in practice (especially if you don't know the internals, which users generally don't). I agree. So I think we should either remove the GUC entirely, or move it to the developer section next to trace_sort (and removing it from the conf). I'll let Robert decide what's best here, but I see your point. Side note: trace_sort actually is documented. It's a bit weird that we have those TRACE_SORT macros at all IMV. I think we should rip those out, and assume every build enables TRACE_SORT, because that's probably true anyway. What do you mean by documented? I thought this might be a good place is: http://www.postgresql.org/docs/devel/static/runtime-config-developer.html which is where trace_sort is documented. I do think that replacement selection could be put to good use for CREATE INDEX if the CREATE INDEX utility command had a "presorted" parameter. Specifically, an implementation of the "presorted" idea that I recently sketched [1] could do better than any presorted replacement selection case we've seen so far because it allows the implementation to optimistically create the index on-the-fly (if that isn't possible, throw an error), without a second pass over tuples sorted on tape. Nothing needs to be stored on a tape/temp file *at all*; the only thing that is stored externally is the index itself. But this patch doesn't add that feature, which can be worked on without the user needing to know about replacement_sort_mem in 9.6. So, I'm not in favor of ripping out the replacement selection code, but think it could make sense to effectively disable it entirely for the time being (with some developer feature to turn it back on for testing). In general, I share your misgivings about the new GUC, though. OK. I'm wondering whether 16MB default is not a bit too much, actually. As explained before, that's not the amount of cache we should expect per process, so maybe ~2-4MB would be a better default value? The obvious presorted case is where we have a SERIAL column, but as I mentioned even that isn't helped by RS. Moreover, it will be significantly hurt with a default maintenance_work_mem of 64MB. Your int4 CREATE INDEX cases clearly show this. BTW couldn't we tune the value automatically for each sort, using the pg_stats.correlation for the sort keys, when available (increasing the replacement_sort_mem when correlation is close to 1.0)? Wouldn't that improve at least some of the regressions? Maybe, but that seems hard. That information isn't conveniently available to the executor/tuplesort, and as we've seen with CREATE INDEX int4 cases, it's far from clear that we'll win even when there definitely is presorted input. Replacement selection needs more than a simple correlation to win, so you'll end up building a cost model with many new problems if this is to work. Sure, that's non-trivial and definitely not a 9.6 material. I'm also wondering whether we need to do choose replacement_sort_mem at planning time, or whether it could be done in the executor based on actually observed data ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
Hi Tomas, Overall, I agree with your summary. On Sun, Apr 3, 2016 at 5:24 AM, Tomas Vondra wrote: > So, let me sum this up, the way I understand the current status. > > > 1) overall, the patch seems to be a clear performance improvement I think that's clear. There are even cases that are over 5x faster, which are representative of some real workloads (e.g., "CREATE INDEX x ON numeric_test (a)" when low_cardinality_almost_asc + maintenance_work_mem=512MB). A lot of the aggregate (datum sort) cases, and heap tuple cases are 3x - 4x faster. > 2) it's unlikely we can improve the performance further I think it's very unlikely that these remaining regressions can be fixed, yes. > Secondly, master is faster only if there's enough on-CPU cache for the > replacement sort (for the memtuples heap), but the benchmark is not > realistic in this respect as it only ran 1 query at a time, so it used the > whole cache (6MB for i5, 12MB for Xeon). > > In reality there will be multiple processes running at the same time (e.g > backends when running parallel query), significantly reducing the amount of > cache per process, making the replacement sort inefficient and thus > eliminating the regressions (by making the master slower). Agreed. And even though the 8MB work_mem cases always have more than enough CPU cache to fit the replacement selection heap, it's still no worse than a mixed picture. The replacement_work_mem=64KB + patch + 8MB (maintenance_)work_mem cases (i.e. replacement selection entirely disabled) don't always do worse; they are often a draw, and sometimes do much better. We *still* win in many cases, sometimes by quite a bit (e.g. "SELECT COUNT(DISTINCT a) FROM int_test" typically loses about 50% of its runtime when patched and RS is disabled at work_mem=8MB). The cases where we lose at work_mem=8MB involve padding and a correlation. The really important case of CREATE INDEX on int4 almost always wins, *even with sorted input* (the almost-but-not-quite-asc-sorted case loses ~1%). We can shave 20% - 30% off the CREATE INDEX int4 cases with just maintenance_work_mem = 8MB. Even in these cases with so much CPU cache relative to work_mem, you need to search for regressed cases to find them, and they are less representative cases. So, while the picture for the work_mem=8MB column alone seems kind of bad, if you consider where the regressions actually occur, you could argue that even that's a draw. > 3) replacement_sort_mem GUC > > I'm not quite sure what's the plan with this GUC. It was useful for > development, but it seems to me it's pretty difficult to tune it in practice > (especially if you don't know the internals, which users generally don't). I agree. > So I think we should either remove the GUC entirely, or move it to the > developer section next to trace_sort (and removing it from the conf). I'll let Robert decide what's best here, but I see your point. Side note: trace_sort actually is documented. It's a bit weird that we have those TRACE_SORT macros at all IMV. I think we should rip those out, and assume every build enables TRACE_SORT, because that's probably true anyway. I do think that replacement selection could be put to good use for CREATE INDEX if the CREATE INDEX utility command had a "presorted" parameter. Specifically, an implementation of the "presorted" idea that I recently sketched [1] could do better than any presorted replacement selection case we've seen so far because it allows the implementation to optimistically create the index on-the-fly (if that isn't possible, throw an error), without a second pass over tuples sorted on tape. Nothing needs to be stored on a tape/temp file *at all*; the only thing that is stored externally is the index itself. But this patch doesn't add that feature, which can be worked on without the user needing to know about replacement_sort_mem in 9.6. So, I'm not in favor of ripping out the replacement selection code, but think it could make sense to effectively disable it entirely for the time being (with some developer feature to turn it back on for testing). In general, I share your misgivings about the new GUC, though. > I'm wondering whether 16MB default is not a bit too much, actually. As > explained before, that's not the amount of cache we should expect per > process, so maybe ~2-4MB would be a better default value? The obvious presorted case is where we have a SERIAL column, but as I mentioned even that isn't helped by RS. Moreover, it will be significantly hurt with a default maintenance_work_mem of 64MB. Your int4 CREATE INDEX cases clearly show this. > BTW couldn't we tune the value automatically for each sort, using the > pg_stats.correlation for the sort keys, when available (increasing the > replacement_sort_mem when correlation is close to 1.0)? Wouldn't that > improve at least some of the regressions? Maybe, but that seems hard. That information isn't conveniently available to the executor/tuplesort, and as we've seen with C
Re: [HACKERS] Using quicksort for every external sort run
Hi, So, let me sum this up, the way I understand the current status. 1) overall, the patch seems to be a clear performance improvement There's far more "green" cells than "red" ones in the spreadsheets, and the patch often shaves off 30-75% of the sort duration. Improvements are pretty much all over the board, for all data sets (low/high/unique cardinality, initial ordering) and data types. 2) it's unlikely we can improve the performance further The regressions are limited to low work_mem settings, which we believe are not representative (or at least not as much as the higher work_mem values), for two main reasons. Firstly, if you need to sort a lot of data (e.g. 10M, as benchmarked), it's quite reasonable to use larger work_mem values. It'd be a bit backwards to reject a patch that gets you 2-4x speedup with enough memory, on the grounds that it may have negative impact with unreasonably small work_mem values. Secondly, master is faster only if there's enough on-CPU cache for the replacement sort (for the memtuples heap), but the benchmark is not realistic in this respect as it only ran 1 query at a time, so it used the whole cache (6MB for i5, 12MB for Xeon). In reality there will be multiple processes running at the same time (e.g backends when running parallel query), significantly reducing the amount of cache per process, making the replacement sort inefficient and thus eliminating the regressions (by making the master slower). 3) replacement_sort_mem GUC I'm not quite sure what's the plan with this GUC. It was useful for development, but it seems to me it's pretty difficult to tune it in practice (especially if you don't know the internals, which users generally don't). The current patch includes the new GUC right next to work_mem, which seems rather unfortunate - I do expect users to simply mess with assuming "more is better" which seems to be rather poor idea. So I think we should either remove the GUC entirely, or move it to the developer section next to trace_sort (and removing it from the conf). I'm wondering whether 16MB default is not a bit too much, actually. As explained before, that's not the amount of cache we should expect per process, so maybe ~2-4MB would be a better default value? Also, not what I'm re-reading the docs for the GUC, I realize it also depends on how the input data is correlated - that seems like a rather useless criteria for tuning, though, because that varies per sort node, so using that for a GUC value set in postgresql.conf does not seem very wise. Actually even on per-query basis that's rather dubious, as it depends on how the sort node gets data (some nodes preserve ordering, some don't). BTW couldn't we tune the value automatically for each sort, using the pg_stats.correlation for the sort keys, when available (increasing the replacement_sort_mem when correlation is close to 1.0)? Wouldn't that improve at least some of the regressions? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Sat, Apr 2, 2016 at 3:22 PM, Peter Geoghegan wrote: > On Sat, Apr 2, 2016 at 3:20 PM, Greg Stark wrote: >> There are also some weird cases in this list where there's a >> significant regression at 32MB but not at 8MB. I would like to see >> 16MB and perhaps 12MB and 24MB. They would help understand if these >> are just quirks or there's a consistent pattern. > > I'll need to drill down to trace_sort output to see what happened there. I looked into this. I too noticed that queries like "SELECT a FROM int_test UNION SELECT a FROM int_test_padding" looked strangely faster for 128MB + high_cardinality_almost_asc + i5 for master branch. This made the patch look relatively bad for the test with those exact properties only; the patch was faster with both lower and higher work_mem settings than 128MB. There was a weird spike in performance for the master branch only. Having drilled down to trace_sort output, I think I know roughly why. I see output like this: 1459308434.753 2016-03-30 05:27:14 CEST STATEMENT: SELECT * FROM (SELECT a FROM int_test UNION SELECT a FROM int_test_padding OFFSET 1e10) ff; I think that this is invalid, because the query was intended as this: SELECT * FROM (SELECT * FROM (SELECT a FROM int_test UNION SELECT a FROM int_test_padding) gg OFFSET 1e10) ff; This would have controlled for client overhead, per my request to Tomas, without altering the "underlying query" that you see in the final spreadsheet. I don't have an exact explanation for why you'd see this spike at 128MB for the master branch but not the other at the moment, but it seems like that one test is basically invalid, and should be discarded. I suspect that the patch didn't see its own similar spike due to my changes to cost_sort(), which reflected that sorts don't need to do so much expensive random I/O. This is the only case that I saw that was not more or less consistent with my expectations, which is good. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Sat, Apr 2, 2016 at 7:31 AM, Tomas Vondra wrote: > So, I do have the results from both machines - I've attached the basic > comparison spreadsheets, the complete summary is available here: > >https://github.com/tvondra/sort-benchmark > > The database log also includes the logs for trace_sort=on for each query > (use the timestamp logged for each query in the spreadsheet to locate the > right section of the log). Thanks! Each row in these spreadsheets shows what looks like a multimodal distribution for the patch (if you focus on the actual run times, not the ratios). IOW, you can clearly see the regressions are only where master has its best case, and the patch its worst case; as the work_mem increases for each benchmark case for the patch, by far the largest improvement is usually seen as we cross the CPU cache threshold. Master gets noticeably slower as work_mem goes from 8MB to 32MB, but the patch gets far far faster. Things continue to improve for patched in absolute terms and especially relative to master following further increases in work_mem, but not nearly as dramatically as that first increment (unless we have lots of padding, which makes the memtuples heap itself much smaller, so it happens one step later). Master shows a slow decline at and past 32MB of work_mem. If the test hardware had a larger L3 cache, we might expect to notice a second big drop, but this hardware doesn't have the enormous L3 cache sizes of new Xeon processors (e.g. 32MB, 45MB). > While it might look like I'm somehow opposed to this patch series, that's > mostly because we tend to look only at the few cases that behave poorly. > > So let me be clear: I do think the patch seems to be a significant > performance improvement for most of the queries, and I'm OK with accepting a > few regressions (particularly if we agree those are pathological cases, > unlikely to happen in real-world workloads). > > It's quite rare that a patch is a universal win without regressions, so it's > important to consider how likely those regressions are and what's the net > effect of the patch - and the patch seems to be a significant improvement in > most cases (and regressions limited to pathological or rare corner cases). > > I don't think those are reasons not to push this into 9.6. I didn't think that you opposed the patch. In fact, you did the right thing by focussing on the low-end regressions, as I've said. I was probably too concerned about Robert failing to consider that they were not representative, particularly with regard to how small the memtuples heap could be relative to the CPU cache; blame it on how close I've become to this problem. I'm pretty confident that Robert can be convinced that these do not matter enough to not commit the patch. In any case, I'm pretty confident that I cannot fix any remaining regressions. > I haven't done any thorough investigation of the results yet, but in general > it seems the results from both machines are quite similar - the numbers are > different, but the speedup/slowdown patterns are mostly the same (with some > exceptions that I'd guess are due to HW differences). I agree. What we clearly see is the advantages of quicksort being cache oblivious, especially relative to master's use of a heap. That advantage becomes pronounced at slightly different points in each case, but the overall picture is the same. This pattern demonstrates why a cache oblivious algorithm is so useful in general -- we don't have to care about tuning for that. As important as this is for serial sorts, it's even more important for parallel sorts, where parallel workers compete for memory bandwidth, and where it's practically impossible to build a cost model for CPU cache size + memory use + nworkers. > The slowdown/speedup patterns (red/green cells in the spreadheets) are also > similar to those collected originally. Some timings are much lower, > presumably thanks to using the "OFFSET 1e10" pattern, but the patterns are > the same. I think it's notable that this made things more predictable, and made the benefits clearer. > The one thing that surprised me a bit is that > > replacement_sort_mem=64 > > actually often made the results considerably worse in many cases. A common > pattern is that the slowdown "spreads" to nearby cells - the are many > queries where the 8MB case is 1:1 with master and 32MB is 1.5:1 (i.e. takes > 1.5x more time), and setting replacement_sort_mem=64 just slows down the 8MB > case. > > In general, replacement_sort_mem=64 seems to only affect the 8MB case, and > in most cases it results in 100% slowdown (so 2x as long queries). To be clear, for the benefit of other people: replacement_sort_mem=64 makes the patch never use a replacement selection heap, even at the lowest tested work_mem setting of 8MB. This is exactly what I expected. When replacement_sort_mem is the proposed default of 16MB, it literally has zero impact on how the patch behaves where work_mem > replacement_sort_
Re: [HACKERS] Using quicksort for every external sort run
On Sat, Apr 2, 2016 at 3:20 PM, Greg Stark wrote: > These are the averages across all queries across all data sets for the > run-time for the patch versus master (not patched 64 which I think is > the replacement_sort_mem=64MB which appears to not be a win). So even > in the less successful cases on average quicksort is faster than > replacement selection. It's actually replacement_sort_mem=64 (64KB -- effectively disabled). So where that case does better or worse, which can only be when work_mem=8MB in practice, that's respectively good or bad for replacement selection. So, typically RS does better when there are presorted inputs with a positive (not inverse/DESC) correlation, and there is little work_mem. As I've said, this is where the CPU cache is large enough to fit the entire memtuples heap. "Padded" cases are mostly bad because they make the memtuples heap relatively small in each case. So with work_mem=32MB, you get a memtuples heap structure similar to work_mem=8MB. The padding pushes things out a bit further, which favors master. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Sat, Apr 2, 2016 at 3:20 PM, Greg Stark wrote: > There are also some weird cases in this list where there's a > significant regression at 32MB but not at 8MB. I would like to see > 16MB and perhaps 12MB and 24MB. They would help understand if these > are just quirks or there's a consistent pattern. I'll need to drill down to trace_sort output to see what happened there. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Sat, Apr 2, 2016 at 3:31 PM, Tomas Vondra wrote: > So let me be clear: I do think the patch seems to be a significant > performance improvement for most of the queries, and I'm OK with accepting a > few regressions (particularly if we agree those are pathological cases, > unlikely to happen in real-world workloads). The ultra-short version of this is: 8MB: 0.98 32MB: 0.79 128MB: 0.63 512MB: 0.51 1GB: 0.42 These are the averages across all queries across all data sets for the run-time for the patch versus master (not patched 64 which I think is the replacement_sort_mem=64MB which appears to not be a win). So even in the less successful cases on average quicksort is faster than replacement selection. But selecting just the cases where 8MB is significantly slower than master it does look like the "padding" data sets are endemic. On the one hand that's a very realistic use-case where I think a lot of users find themselves. I know in my days as a web developer I typically threw a lot of columns into my queries and through a lot of joins and order by and then left it to the application to pick through the recordsets that were returned for the columns that were of interest. The tuples being sorted were probably huge. On the other hand perhaps this is something better tackled by the planner. If the planner can arrange sorts to happen when the rows are narrower that would be a a bigger win than trying to move a lot of data around like this. (In the extreme if it were possible to replace unnecessary columns by the tid and then refetching them later though that's obviously more than a little tricky to do effectively.) There are also some weird cases in this list where there's a significant regression at 32MB but not at 8MB. I would like to see 16MB and perhaps 12MB and 24MB. They would help understand if these are just quirks or there's a consistent pattern. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Thu, Feb 4, 2016 at 3:14 AM, Peter Geoghegan wrote: > Nyberg et al may have said it best in 1994, in the Alphasort Paper [1]: This paper is available from http://www.vldb.org/journal/VLDBJ4/P603.pdf (previously link is now dead) > The paper also has very good analysis of the economics of sorting: > > "Even for surprisingly large sorts, it is economical to perform the > sort in one pass." I suggest taking a look at "Figure 2. Replacement-selection sort vs. QuickSort" in the paper. It confirms what I said recently about cache size. The diagram is annotated: "The tournament tree of replacement-selection sort at left has bad cache behavior, unless the entire tournament fits in cache". I think we're well justified in giving no weight at all to cases where the *entire* tournament tree (heap) fits in cache, because it's not economical to use a cpu-cache-sized work_mem setting. It simply makes no sense. I understand the reluctance to give up on replacement selection. The authors of this paper were themselves reluctant to do so. As they put it: """ We were reluctant to abandon replacement-selection sort, because it has stability and it generates long runs. Our first approach was to improve replacement-selection sort's cache locality. Standard replacement-selection sort has terrible cache behavior, unless the tournament fits in cache. The cache thrashes on the bottom levels of the tournament. If you think of the tournament as a tree, each replacement-selection step traverses a path from a pseudo-random leaf of the tree to the root. The upper parts of the tree may be cache resident, but the bulk of the tree is not. We investigated a replacement-selection sort that clusters tournament nodes so that most parent-child node pairs are contained in the same cache line. This technique reduces cache misses by a factor of two or three. Nevertheless, replacement-selection sort is still less attractive than QuickSort because: 1. The cache behavior demonstrates less locality than QuickSorts. Even when QuickSort runs did not fit entirely in cache, the average compare-exchange time did not increase significantly. 2. Tournament sort is more CPU-intensive than QuickSort. Knuth calculated a 2:1 ratio for the programs he wrote. We observed a 2.5:1 speed advantage for QuickSort over the best tournament sort we wrote. The key to achieving high execution speeds on fast processors is to minimize the number of references that cannot be serviced by the on-board cache (4MB in the case of the DEC 7000 AXP). As mentioned before, QuickSort's memory access patterns are sequential and, thus, have good cache behavior """ This paper is co-authored by Jim Gray, a Turing award laureate, as well as some other very notable researchers. The paper appeared in "Readings in Database Systems, 4th edition", which was edited by by Joseph Hellerstein and Michael Stonebraker. These days, the cheapest consumer level CPUs have 4MB caches (in 1994, that was exceptional), so if this analysis wasn't totally justified in 1994, when the paper was written, it is today. I've spent a lot of time analyzing this problem. I've been looking at external sorting in detail for almost a year now. I've done my best to avoid any low-end regressions. I am very confident that I cannot do any better than I already have there, though. If various very influential figures in the database research community could not do better, then I have doubts that we can. I started with the intuition that we should still use replacement selection myself, but that just isn't well supported by benchmarking cases with sensible work_mem:cache size ratios. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Wed, Mar 30, 2016 at 4:22 AM, Greg Stark wrote: > I'm sorry I was intending to run those benchmarks again this past week > but haven't gotten around to it. But my plan was to run them on a good > server I borrowed, an i7 with 8MB cache. I can still go ahead with > that but I can also try running it on the home server again too if you > want (and AMD N36L with 1MB cache). I don't want to suggest that people not test the very low end on very high end hardware. That's fine, as long as it's put in context. Considerations about the economics of cache sizes and work_mem settings are crucial to testing the patch objectively. If everything fits in cache anyway, then you almost eliminate the advantages quicksort has, but you should be using an internal sort for anyway. I think that this is just common sense. I would like to see a low-end benchmark for low-end work_mem settings too, though. Maybe you could repeat the benchmark I linked to, but with a recent version of the patch, including commit 0011c0091e886b. Compare that to the master branch just before 0011c0091e886b went in. I'm curious about how the more recent memory context resetting stuff that made it into 0011c0091e886b left us regression-wise. Tomas tested that, of course, but I have some concerns about how representative his numbers are at the low end. > But even for the smaller machines I don't think we should really be > caring about regressions in the 4-8MB work_mem range. Earlier in the > fuzzer work I was surprised to find out it can take tens of megabytes > to compile a single regular expression (iirc it was about 30MB for a > 64-bit machine) before you get errors. It seems surprising to me that > a single operator would consume more memory than an ORDER BY clause. I > was leaning towards suggesting we just bump up the default work_mem to > 8MB or 16MB. Today, it costs less than USD $40 for a new Raspberry Pi 2, which has 1GB of memory. I couldn't figure out exactly how much CPU cache that model has, put I'm pretty sure it's no more than 256KB. Memory just isn't that expensive; memory bandwidth is expensive. I agree that we could easily justify increasing work_mem to 8MB, or even 16MB. It seems almost silly to point it out, but: Increasing sort performance has the effect of decreasing the duration of sorts, which could effectively decrease memory use on the system. Increasing the memory available to sorts could decrease the overall use of memory. Being really frugal with memory is expensive, maybe even if your primary concern is the expense of memory usage, which it probably isn't these days. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Wed, Mar 30, 2016 at 7:23 AM, Peter Geoghegan wrote: > Anyway, what I liked about Greg's approach to finding regressions at > the low end was that when testing, he used the cheapest possible VM > available on Google's cloud platform. When testing the low end, he had > low end hardware to go with the low end work_mem settings. This gave > the patch the benefit of using quicksort to make good use of what I > assume is a far smaller L2 cache; certainly nothing like 6MB or 12MB. > I think Greg might have used a home server to test my patch in [1], > actually, but I understand that it too was suitably low-end. I'm sorry I was intending to run those benchmarks again this past week but haven't gotten around to it. But my plan was to run them on a good server I borrowed, an i7 with 8MB cache. I can still go ahead with that but I can also try running it on the home server again too if you want (and AMD N36L with 1MB cache). But even for the smaller machines I don't think we should really be caring about regressions in the 4-8MB work_mem range. Earlier in the fuzzer work I was surprised to find out it can take tens of megabytes to compile a single regular expression (iirc it was about 30MB for a 64-bit machine) before you get errors. It seems surprising to me that a single operator would consume more memory than an ORDER BY clause. I was leaning towards suggesting we just bump up the default work_mem to 8MB or 16MB. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Tue, Mar 29, 2016 at 6:02 PM, Tomas Vondra wrote: > That may be easily due to differences between the CPUs and configuration. > For example the Xeon uses a way older CPU with different amounts of CPU > cache, and it's also a multi-socket system. And so on. So, having searched past threads I guess this was your Xeon E5450, which has a 12MB cache. I also see that you have an Intel Core i5-2500K Processor, which has 6MB of L2 cache. This hardware is mid-end, and the CPUs were discontinued in 2010 and 2013 respectively. Now, the i5 has a smaller L2 cache, so if anything I'd expect it to do worse than the Xeon, not better. But leaving that aside, I think there is an issue that we don't want to lose sight of. Which is: In most of the regressions we were discussing today, perhaps the entire heap structure can fit in L2 cache. This would be true for stuff like int4 CREATE INDEX builds, where a significant fraction of memory is used for IndexTuples, which most or all comparisons don't have to read in memory. This is the case with a CPU that was discontinued by the manufacturer just over 5 years ago. I think this is why "padding" cases can make the patch look not much better and occasionally worse at the low end: Those keep the number of memtuples as a fraction of work_mem very low, and so mask the problems with the replacement selection heap. When Greg Stark benchmarked the patch at the low end, to identify regressions, he did find some slight regressions at the lowest work_mem settings with many many passes, but they were quite small [1]. Greg also did some good analysis of the performance characteristics of external sorting today [2] that I recommend reading if you missed. It's possible that those regressions have since been fixed, because Greg did not apply/test the memory batching patch that became commit 0011c0091e886b as part of this. It seems likely that it's at least partially fixed, and it might even be better than master overall, now. Anyway, what I liked about Greg's approach to finding regressions at the low end was that when testing, he used the cheapest possible VM available on Google's cloud platform. When testing the low end, he had low end hardware to go with the low end work_mem settings. This gave the patch the benefit of using quicksort to make good use of what I assume is a far smaller L2 cache; certainly nothing like 6MB or 12MB. I think Greg might have used a home server to test my patch in [1], actually, but I understand that it too was suitably low-end. It's perfectly valid to bring economics into this; typically, an external sort occurs only because memory isn't infinitely affordable, or it isn't worth provisioning enough memory to be totally confident that you can do every sort internally. With external sorting, the constant factors are what researchers generally spend most of the time worrying about. Knuth spends a lot of time discussing how the characteristics of actual magnetic tape drives changed throughout the 1970s in TAOCP Volume III. It's quite valid to ask if anyone would actually want to have an 8MB work_mem setting on a machine that has 12MB of L2 cache, cache that an external sort gets all to itself. Is that actually a practical setup that anyone would want to use? [1] http://www.postgresql.org/message-id/CAM-w4HOwt0C7ZndowHUuraw+xi+BhY5a6J008XoSq=r9z7h...@mail.gmail.com [2] http://www.postgresql.org/message-id/CAM-w4HM4XW3u5kVEuUrr+L+KX3WZ=5JKk0A=djjzypkb-hy...@mail.gmail.com -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Tue, Mar 29, 2016 at 6:02 PM, Tomas Vondra wrote: > And why not? I mean, why should it be acceptable to slow down? My point was that over 80% of execution time was spent in the HashAggregate, which outputs tuples to the sort. That, and the huge i5/Xeon inconsistency (in the extent to which this is regressed -- it's not at all, or it's regressed a lot) makes me suspicious that there is something else going on. Possibly involving the scheduling of I/O. > That may be easily due to differences between the CPUs and configuration. > For example the Xeon uses a way older CPU with different amounts of CPU > cache, and it's also a multi-socket system. And so on. We're talking about a huge relative difference with that HashAggregate plan, though. I don't think that those relative differences are explained by differing CPU characteristics. But I guess we'll find out soon enough. >> If there is ever a regression, it is only really sensible to talk >> about it while looking at trace_sort output (and, I guess, the query >> plan). I've asked Tomas for trace_sort output in all relevant cases. >> There is no point in "flying blind" and speculating what the problem >> was from a distance. > > > The updated benchmarks are currently running. I'm out of office until > Friday, and I'd like to process the results over the weekend. FWIW I'll have > results for these cases: > > 1) unpatched (a414d96a) > 2) patched, default settings > 3) patched, replacement_sort_mem=64 > > Also, I'll have trace_sort=on output for all the queries, so we can > investigate further. Thanks! That will tell us a lot more. > Yeah. That was one of the goals of the benchmark, to come up with some > tuning recommendations. On some systems significantly increasing memory GUCs > may not be possible, though - say, on very small systems with very limited > amounts of RAM. Fortunately, such systems will probably mostly use external sorts for CREATE INDEX cases, and there seems to be very little if any downside there, at least according to your similarly, varied tests of CREATE INDEX. >> I don't think they are representative. Greg Stark characterized the >> regressions as being fairly limited, mostly at the very low end. And >> that was *before* all the memory fragmentation stuff made that >> better. I haven't done any analysis of how much better that made the >> problem *across the board* yet, but for int4 cases I could make 1MB >> work_mem queries faster with gigabytes of data on my laptop. I >> believe I tested various datum sort cases there, like "select >> count(distinct(foo)) from bar"; those are a very pure test of the >> patch. >> > > Well, I'd guess those conclusions may be a bit subjective. I think that the conclusion that we should do something or not do something based on this information is subjective. OTOH, whether and to what extent these tests are representative of real user workloads seems much less subjective. This is not a criticism of the test cases you came up with, which rightly emphasized possibly regressed cases. I think everyone already understood that the picture was very positive at the high end, in memory rich environments. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
Hi, On 03/29/2016 09:43 PM, Peter Geoghegan wrote: On Tue, Mar 29, 2016 at 9:11 AM, Robert Haas wrote: One test that kind of bothers me in particular is the "SELECT DISTINCT a FROM numeric_test ORDER BY a" test on the high_cardinality_random data set. That's a wash at most work_mem values, but at 32MB it's more than 3x slower. That's very strange, and there are a number of other results like that, where one particular work_mem value triggers a large regression. That's worrying. That case is totally invalid as a benchmark for this patch. Here is the query plan I get (doesn't matter if I run analyze) when I follow Tomas' high_cardinality_random 10M instructions (including setting work_mem to 32MB): postgres=# explain analyze select distinct a from numeric_test order by a; QUERY PLAN ─── Sort (cost=268895.39..270373.10 rows=591082 width=8) (actual time=3907.917..4086.174 rows=999879 loops=1) Sort Key: a Sort Method: external merge Disk: 18536kB -> HashAggregate (cost=206320.50..212231.32 rows=591082 width=8) (actual time=3109.619..3387.599 rows=999879 loops=1) Group Key: a -> Seq Scan on numeric_test (cost=0.00..175844.40 rows=12190440 width=8) (actual time=0.025..601.295 rows=1000 loops=1) Planning time: 0.088 ms Execution time: 4120.656 ms (8 rows) Does that seem like a fair test of this patch? And why not? I mean, why should it be acceptable to slow down? I must also point out an inexplicable differences between the i5 and Xeon in relation to this query. It took about took 10% less time on the patched Xeon 10M case, not ~200% more (line 53 of the summary page in each 10M case). So even if this case did exercise the patch well, it's far from clear that it has even been regressed at all. It's far easier to imagine that there was some problem with the i5 tests. That may be easily due to differences between the CPUs and configuration. For example the Xeon uses a way older CPU with different amounts of CPU cache, and it's also a multi-socket system. And so on. A complete do-over from Tomas would be best, here. He has already acknowledged that the i5 CREATE INDEX results were completely invalid. Pending a do-over from Tomas, I recommend ignoring the i5 tests completely. Also, I should once again point out that many of the work_mem cases actually had internal sorts at the high end, so once the code in the patches simply wasn't exercised at all at the high end (the 1024MB cases, where the numbers might be expected to get really good). If there is ever a regression, it is only really sensible to talk about it while looking at trace_sort output (and, I guess, the query plan). I've asked Tomas for trace_sort output in all relevant cases. There is no point in "flying blind" and speculating what the problem was from a distance. The updated benchmarks are currently running. I'm out of office until Friday, and I'd like to process the results over the weekend. FWIW I'll have results for these cases: 1) unpatched (a414d96a) 2) patched, default settings 3) patched, replacement_sort_mem=64 Also, I'll have trace_sort=on output for all the queries, so we can investigate further. Also, it's pretty clear that the patch has more large wins than it does large losses, but it seems pretty easy to imagine people who haven't tuned any GUCs writing in to say that 9.6 is way slower on their workload, because those people are going to be at work_mem=4MB, maintenance_work_mem=64MB. At those numbers, if Tomas's data is representative, it's not hard to imagine that the number of people who see a significant regression might be quite a bit larger than the number who see a significant speedup. Yeah. That was one of the goals of the benchmark, to come up with some tuning recommendations. On some systems significantly increasing memory GUCs may not be possible, though - say, on very small systems with very limited amounts of RAM. I don't think they are representative. Greg Stark characterized the regressions as being fairly limited, mostly at the very low end. And that was *before* all the memory fragmentation stuff made that better. I haven't done any analysis of how much better that made the problem *across the board* yet, but for int4 cases I could make 1MB work_mem queries faster with gigabytes of data on my laptop. I believe I tested various datum sort cases there, like "select count(distinct(foo)) from bar"; those are a very pure test of the patch. Well, I'd guess those conclusions may be a bit subjective. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription
Re: [HACKERS] Using quicksort for every external sort run
On Tue, Mar 29, 2016 at 12:43 PM, Peter Geoghegan wrote: > A complete do-over from Tomas would be best, here. He has already > acknowledged that the i5 CREATE INDEX results were completely invalid. The following analysis is all based on Xeon numbers, which as I've said we should focus on pending a do-over from Tomas. Especially important here is the larget set -- the 10M numbers from results-xeon-10m.ods. I think that abbreviation distorts things here. We also see distortion from "padding" cases. Rather a lot of "padding" is used, FWIW. From Tomas' script: INSERT INTO numeric_test_padding SELECT a, repeat(md5(a::text),10) FROM data_float ORDER BY a; This makes the tests have TOAST overhead. Some important observations on results-xeon-10m: * There are almost no regressions for types that don't use abbreviation. There might be one exception when there is both padding and presorted input -- the 32MB high_cardinality_almost_asc/high_cardinality_sorted/unique_sorted "SELECT * FROM int_test_padding ORDER BY a", which takes 26% - 35% longer (those are all basically the same cases). But it's a big win in the high_cardinality_random, unique_random, and even unique_almost_asc categories, or when DESC order was requested in all categories (I note that there is certainly an emphasis on pre-sorted cases in the choice of categories). Other than that, no regressions from non-abbreviated types. * No CREATE INDEX case is ever appreciably regressed, even with maintenance_work_mem at 8MB, 1/4 of its default value of 64MB. (Maybe we lose 1% - 3% with the other (results-xeon-1m.ods) cases, where maintenance_work_mem is close to or actually high enough to get an internal sort). It's a bit odd that "CREATE INDEX x ON text_test_padding (a)" is about a wash for high_cardinality_almost_asc, but I think that's just because we're super I/O bound for this presorted case, but cannot make up for it with quicksort's "bubble sort best case" precheck for presortedness, so replacement selection does better in a way that might even result in a clean draw. CREATE INDEX looks very good in general. I think abbreviation might abort in one or two cases for text, but the picture for the patch is still solid. * "Padding" can really distort low-end cases, that become more about moving big tuples around than actual sorting. If you really want to see how high_cardinality_almost_asc queries like "SELECT * FROM text_test_padding ORDER BY a" are testing the wrong thing, consider the best and worst case for the master branch with any amount of work_mem. The 10 million tuple high_cardinality_almost_asc case takes 40.16 seconds, 39.95 seconds, 40.98 seconds, and 41.28 seconds, and 42.1 seconds for respective work_mems of 8MB, 32MB, 128MB, 512MB, and 1024MB. This is a very narrow case because it totally deemphasizes comparison cost and emphasizes moving tuples around, involves abbreviation of text with a merge phase that cannot use abbreviation that only the patch has due to RS best-case on master. The case is seriously short changed by the memory batching refund thing in practice. When is *high cardinality text* (not dates or something) ever likely to be found in pre-sorted order for 10 million tuples in the real world? Besides, we just stopped trusting strxfrm(), so the case would probably be a wash now at worst. * The more plausible padding + presorted + abbreviation case that is sometimes regressed is "SELECT * FROM numeric_test_padding ORDER BY a". But that's regressed a lot less than the aforementioned "SELECT * FROM text_test_padding ORDER BY a" case, and only at the low end. It is sometimes faster where the original case I mentioned is slower. * Client overhead may distort things in the case of queries like "SELECT * FROM foo ORDER BY bar". This could be worse for the patch, which does relatively more computation during the final on-the-fly merge phase (which is great when you can overlap that with I/O; perhaps not when you get more icache misses with other computation). Aside from just adding a lot of noise, this could unfairly make the patch look a lot worse than master. Now, I'm not saying all of this doesn't matter. But these are all fairly narrow, pathological cases, often more about moving big tuples around (in memory and on disk) than about sorting. These regressions are well worth it. I don't think I can do any more than I already have to fix these cases; it may be impossible. It's a very difficult thing to come up with an algorithm that's unambiguously better in every possible case. I bent over backwards to fix low-end regressions already. In memory rich environments with lots of I/O bandwidth, I've seen this patch make CREATE INDEX ~2.5x faster for int4, on a logged table. More importantly, the patch makes setting maintenance_work_mem easy. Users' intuition for how sizing it ought to work now becomes more or less correct: In general, for each individual utility command bound by maintenance_work_mem, more memory is better. That
Re: [HACKERS] Using quicksort for every external sort run
On Tue, Mar 29, 2016 at 9:11 AM, Robert Haas wrote: > One test that kind of bothers me in particular is the "SELECT DISTINCT > a FROM numeric_test ORDER BY a" test on the high_cardinality_random > data set. That's a wash at most work_mem values, but at 32MB it's > more than 3x slower. That's very strange, and there are a number of > other results like that, where one particular work_mem value triggers > a large regression. That's worrying. That case is totally invalid as a benchmark for this patch. Here is the query plan I get (doesn't matter if I run analyze) when I follow Tomas' high_cardinality_random 10M instructions (including setting work_mem to 32MB): postgres=# explain analyze select distinct a from numeric_test order by a; QUERY PLAN ─── Sort (cost=268895.39..270373.10 rows=591082 width=8) (actual time=3907.917..4086.174 rows=999879 loops=1) Sort Key: a Sort Method: external merge Disk: 18536kB -> HashAggregate (cost=206320.50..212231.32 rows=591082 width=8) (actual time=3109.619..3387.599 rows=999879 loops=1) Group Key: a -> Seq Scan on numeric_test (cost=0.00..175844.40 rows=12190440 width=8) (actual time=0.025..601.295 rows=1000 loops=1) Planning time: 0.088 ms Execution time: 4120.656 ms (8 rows) Does that seem like a fair test of this patch? I must also point out an inexplicable differences between the i5 and Xeon in relation to this query. It took about took 10% less time on the patched Xeon 10M case, not ~200% more (line 53 of the summary page in each 10M case). So even if this case did exercise the patch well, it's far from clear that it has even been regressed at all. It's far easier to imagine that there was some problem with the i5 tests. A complete do-over from Tomas would be best, here. He has already acknowledged that the i5 CREATE INDEX results were completely invalid. Pending a do-over from Tomas, I recommend ignoring the i5 tests completely. Also, I should once again point out that many of the work_mem cases actually had internal sorts at the high end, so once the code in the patches simply wasn't exercised at all at the high end (the 1024MB cases, where the numbers might be expected to get really good). If there is ever a regression, it is only really sensible to talk about it while looking at trace_sort output (and, I guess, the query plan). I've asked Tomas for trace_sort output in all relevant cases. There is no point in "flying blind" and speculating what the problem was from a distance. > Also, it's pretty clear that the patch has more large wins than it > does large losses, but it seems pretty easy to imagine people who > haven't tuned any GUCs writing in to say that 9.6 is way slower on > their workload, because those people are going to be at work_mem=4MB, > maintenance_work_mem=64MB. At those numbers, if Tomas's data is > representative, it's not hard to imagine that the number of people who > see a significant regression might be quite a bit larger than the > number who see a significant speedup. I don't think they are representative. Greg Stark characterized the regressions as being fairly limited, mostly at the very low end. And that was *before* all the memory fragmentation stuff made that better. I haven't done any analysis of how much better that made the problem *across the board* yet, but for int4 cases I could make 1MB work_mem queries faster with gigabytes of data on my laptop. I believe I tested various datum sort cases there, like "select count(distinct(foo)) from bar"; those are a very pure test of the patch. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Mon, Mar 28, 2016 at 11:18 PM, Peter Geoghegan wrote: > Note that amcheck V2, which I posted just now features tests for > external sorting. The way these work requires discussion. The tests > are motivated in part by the recent strxfrm() debacle, as well as by > the need to have at least some test coverage for this patch. It's bad > that external sorting currently has no test coverage. We should try > and do better there as part of this overhaul to tuplesort.c. Test coverage is good! However, I don't see that you've responded to Tomas Vondra's report of regressions. Maybe you're waiting for more data from him, but we're running out of time here. I think what we need to decide is whether these results are bad enough that the patch needs more work on the regressed cases, or whether we're comfortable with some regressions in low-memory configurations for the benefit of higher-memory configurations. I'm kind of on the fence about that, myself. One test that kind of bothers me in particular is the "SELECT DISTINCT a FROM numeric_test ORDER BY a" test on the high_cardinality_random data set. That's a wash at most work_mem values, but at 32MB it's more than 3x slower. That's very strange, and there are a number of other results like that, where one particular work_mem value triggers a large regression. That's worrying. Also, it's pretty clear that the patch has more large wins than it does large losses, but it seems pretty easy to imagine people who haven't tuned any GUCs writing in to say that 9.6 is way slower on their workload, because those people are going to be at work_mem=4MB, maintenance_work_mem=64MB. At those numbers, if Tomas's data is representative, it's not hard to imagine that the number of people who see a significant regression might be quite a bit larger than the number who see a significant speedup. On the whole, I'm tempted to say this needs more work before we commit to it, but I'd like to hear other opinions on that point. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Thu, Mar 10, 2016 at 6:54 PM, Peter Geoghegan wrote: > I've used amcheck [2] to test this latest revision -- the tool ought > to not see any problems with any index created with the patch applied. > Reviewers might find it helpful to use amcheck, too. As 9.6 is > stabilized, I anticipate that amcheck will give us a fighting chance > at early detection of any bugs that might have slipped into tuplesort, > or a B-Tree operator class. Since we still don't even have one single > test of the external sort code [3], it's just as well. If we wanted to > test external sorting, maybe we'd do that by adding tests to amcheck, > that are not run by default, much like test_decoding, which tests > logical decoding but is not targeted by "make installcheck"; that > would allow the tests to be fairly comprehensive without being > annoying. Using amcheck neatly side-steps issues with the portability > of "expected" pg_regress output when collatable type sorting is > tested. Note that amcheck V2, which I posted just now features tests for external sorting. The way these work requires discussion. The tests are motivated in part by the recent strxfrm() debacle, as well as by the need to have at least some test coverage for this patch. It's bad that external sorting currently has no test coverage. We should try and do better there as part of this overhaul to tuplesort.c. Thanks -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Sun, Mar 20, 2016 at 11:01 PM, Peter Geoghegan wrote: > Allowing 0 tuple runs in rare cases seems like the simplest solution. > After all, mergeprereadone() is expressly prepared for 0 tuple runs. > It says "ensure that we have at least one tuple, if any are to be > had". There is no reason to assume that it says this only because it > imagines that no tuples might be found *only after* the first preread > for the merge (by which I mean I don't think that only applies when a > final on-the-fly merge reloads tuples from one particular tape > following running out of tuples of the tape/run in memory). I just realized that there is what amounts to an over-zealous assertion in dumpbatch(): > +* When this edge case hasn't occurred, the first memtuple should not > +* be found to be heapified (nor should any other memtuple). > +*/ > + Assert(state->memtupcount == 0 || > + state->memtuples[0].tupindex == HEAP_RUN_NEXT); The problem is that state->memtuples[0].tupindex won't have been *reliably* initialized here. We could make sure that it is for the benefit of this assertion, but I think it would be better to just remove the assertion, which isn't testing very much over and above the similar assertions that appears in the only dumpbatch() caller, dumptuples(). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Wed, Mar 23, 2016 at 8:05 PM, Tomas Vondra wrote: > FWIW, maintenance_work_mem was set to 1GB on the i5 machine and 256MB on the > Xeon. Hmm, maybe that's why we see no difference for CREATE INDEX on the i5, > and an improvement on the Xeon. That would explain it. > Not a big deal - it's easy enough to change the config and repeat the > benchmark. Are there any particular replacement_sort_mem values that you > think would be interesting to configure? I would start with replacement_sort_mem=64. i.e., 64KB, effectively disabled > I have to admit I'm a bit afraid we'll introduce a new GUC that only very > few users will know how to set properly, and so most people will run with > the default value or set it to something stupid. I agree. > Are you saying none of the queries triggers the memory context resets? What > queries would trigger that (to test the theory)? They will still do the context resetting and so on just the same, but would use a heap for the first attempt. But replacement_sort_mem=64 would let us know that > But if we care about 9.5 -> 9.6 regressions, then perhaps we should include > that commit into the benchmark, because that's what the users will see? Or > have I misunderstood the second part? I think it's good that you didn't test the March 17 commit of the memory batching to the master branch when testing the master branch. You should continue to do that, because we care about regressions against 9.5 only. The only issue insofar as what code was tested is that replacement_sort_mem was not set to 64 (to effectively disable any use of the heap by the patch). I would like to see if we can get rid of replacement_sort_mem without causing any real regressions, which I think the memory context reset stuff makes possible. There was a new version of my quicksort patch posted after March 17, but don't worry about it -- that's totally cosmetic. Some minor tweaks. > BTW which patch does the memory batching? A quick search through git log did > not return any recent patches mentioning these terms. Commit 0011c0091e886b874e485a46ff2c94222ffbf550. But, like I said, avoid changing what you're testing as master; do not include that. The patch set you were testing is fine. Nothing is missing. > As I mentioned above, I haven't realized work_mem does not matter for CREATE > INDEX, and maintenance_work_mem was set to a fixed value for the whole test. > And the two machines used different values for this particular configuration > value - Xeon used just 256MB, while i5 used 1GB. So while on i5 it was just > a single chunk, on Xeon there were multiple batches. Hence the different > behavior. Makes sense. Obviously this should be avoided, though. > No it doesn't add overhead. The script actually does > > COPY (query) TO '/dev/null' > > on the server for all queries (except for the CREATE INDEX, obviously), so > there should be pretty much no overhead due to transferring rows to the > client and so on. That still adds overhead, because the output functions are still used to create a textual representation of data. This was how Andres tested the improvement to the timestamptz output function committed to 9.6, for example. >> If you really wanted to make the patch look good, a sort with 5GB of >> work_mem is the best way, FWIW. The heap data structure used by the >> master branch tuplesort.c will handle that very badly. You use no >> temp_tablespaces here. I wonder if the patch would do better with >> that. Sorting can actually be quite I/O bound with the patch >> sometimes, where it's usually CPU/memory bound with the heap, >> especially with lots of work_mem. More importantly, it would be more >> informative if the temp_tablespace was not affected by I/O from >> Postgres' heap. > > > I'll consider testing that. However, I don't think there was any significant > I/O on the machines - particularly not on the Xeon, which has 16GB of RAM. > So the temp files should fit into that quite easily. Right, but with a bigger sort, there might well be more I/O. Especially for the merge. It might be that that holds back the patch from doing even better than the master branch does. > The i5 machine has only 8GB of RAM, but it has 6 SSD drives in raid0. So I > doubt it was I/O bound. These patches can sometimes be significantly I/O bound on my laptop, where that didn't happen before. Sounds unlikely here, though. >> I also like seeing a sample of "trace_sort = on" output. I don't >> expect you to carefully collect that in every case, but it can tell >> us a lot about what's really going on when benchmarking. > > > Sure, I can collect that. Just for the interesting cases. Or maybe just dump it all and let me figure it out for myself. trace_sort output shows me how many runs they are, how abbreviation did, how memory was used, and even if the sort was I/O bound at various stages (it dumps some getrusage stats to the log, too). You can usually tell exactly what happened for external sorts, which is very in
Re: [HACKERS] Using quicksort for every external sort run
Hi, On 03/24/2016 03:00 AM, Peter Geoghegan wrote: On Tue, Mar 22, 2016 at 3:35 PM, Tomas Vondra wrote: I've tested the patch you've sent on 2016/3/11, which I believe is the last version. I haven't tuned the replacement_sort_mem at all. because my understanding was that it's not a 9.6 material (per your message). So my intent was to test the configuration people are likely to use by default. I meant that using replacement selection in a special way with CREATE INDEX was not 9.6 material. But replacement_sort_mem is. And so, any case with the (maintenance)_work_mem <= 16MB will have used a heap for the first run. FWIW, maintenance_work_mem was set to 1GB on the i5 machine and 256MB on the Xeon. Hmm, maybe that's why we see no difference for CREATE INDEX on the i5, and an improvement on the Xeon. I'm sorry I did not make a point of telling you this. It's my fault. The result in any case is that pre-sorted cases will be similar with and without the patch, since replacement selection can thereby make one long run. But on non-sorted cases, the patch helps less because it is in use less -- with not so much data overall, possibly much less (which I think explains why the 1M row tests seem so much less interesting than the 10M row tests). Not a big deal - it's easy enough to change the config and repeat the benchmark. Are there any particular replacement_sort_mem values that you think would be interesting to configure? I have to admit I'm a bit afraid we'll introduce a new GUC that only very few users will know how to set properly, and so most people will run with the default value or set it to something stupid. I worry that at the low end, replacement_sort_mem makes the patch have one long run, but still some more other runs, so merging is unbalanced. We should consider if the patch can beat the master branch at the low end without using a replacement selection heap. It would do better in at least some cases in low memory conditions, possibly a convincing majority of cases. I had hoped that my recent idea (since committed) of resetting memory contexts would help a lot with regressions when work_mem is very low, and that particular theory isn't really tested here. Are you saying none of the queries triggers the memory context resets? What queries would trigger that (to test the theory)? I'm not sure which commit are you referring to. The benchmark was done on a414d96a (from 2016/3/10). However I'd expect that to affect both sets of measurements, although it's possible that it affects the patched version differently. You did test the right patches. It just so happens that the master branch now has the memory batching stuff now, so it doesn't get credited with that. I think this is good, though, because we care about 9.5 -> 9.6 regressions. So there's a commit in master (but not in 9.5), adding memory batching, but it got committed before a414d96a so the benchmark does not measure it's impact (with respect to 9.5). Correct? But if we care about 9.5 -> 9.6 regressions, then perhaps we should include that commit into the benchmark, because that's what the users will see? Or have I misunderstood the second part? BTW which patch does the memory batching? A quick search through git log did not return any recent patches mentioning these terms. Improvement ratio (master time/patched time) for Xeon 10 million row case "SELECT * FROM int_test_padding ORDER BY a DESC": For work_mem of 8MB = 0.83, 32MB = 0.62, 128MB = 0.52, 512MB = 0.47, 1024MB = 1.00 So, it gets faster than the master branch as more memory is available, but then it goes to 1.00 -- a perfect draw. I think that this happened simply because at that point, the sort was an internal sort (even though similar CREATE INDEX case did not go internal at the same point). The (internal) 1024MB case is not that much faster than the 512MB external case, which is pretty good. Indeed. There are also "near draws", where the ratio is 0.98 or so. I think that this is because abbreviation is aborted, which can be a problem with synthetic data + text -- you get a very slow sort either way, That is possible, yes. It's true that the worst regressions are on text, although there are a few on numeric too (albeit not as significant). where most time is spent calling strcoll(), and cache characteristics matter much less. Those cases seemingly take much longer overall, so this theory makes sense. Unfortunately, abbreviated keys for text that is not C locale text was basically disabled across the board today due to a glibc problem. :-( Yeah. Bummer :-( Whenever I see that the patch is exactly as fast as the master branch, I am skeptical. I am particularly skeptical of all i5 results (including 10M cases), because the patch seems to be almost perfectly matched to the master branch for CREATE INDEX cases (which are the best cases for the patch on your Xeon server) -- it's much easier to believe that there was a problem du
Re: [HACKERS] Using quicksort for every external sort run
On Tue, Mar 22, 2016 at 2:27 PM, Tomas Vondra wrote: > For example these two queries got almost 2x as slow for some data sets: > > SELECT a FROM numeric_test UNION SELECT a FROM numeric_test_padding > SELECT a FROM text_test UNION SELECT a FROM text_test_padding > > I assume the slowdown is related to the batching (as it's only happening for > low work_mem values), so perhaps there's an internal heuristics that we > could tune? Can you show trace_sort output for these cases? Both master, and patched? Thanks -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Tue, Mar 22, 2016 at 3:35 PM, Tomas Vondra wrote: > I've tested the patch you've sent on 2016/3/11, which I believe is the last > version. I haven't tuned the replacement_sort_mem at all. because my > understanding was that it's not a 9.6 material (per your message). So my > intent was to test the configuration people are likely to use by default. I meant that using replacement selection in a special way with CREATE INDEX was not 9.6 material. But replacement_sort_mem is. And so, any case with the (maintenance)_work_mem <= 16MB will have used a heap for the first run. I'm sorry I did not make a point of telling you this. It's my fault. The result in any case is that pre-sorted cases will be similar with and without the patch, since replacement selection can thereby make one long run. But on non-sorted cases, the patch helps less because it is in use less -- with not so much data overall, possibly much less (which I think explains why the 1M row tests seem so much less interesting than the 10M row tests). I worry that at the low end, replacement_sort_mem makes the patch have one long run, but still some more other runs, so merging is unbalanced. We should consider if the patch can beat the master branch at the low end without using a replacement selection heap. It would do better in at least some cases in low memory conditions, possibly a convincing majority of cases. I had hoped that my recent idea (since committed) of resetting memory contexts would help a lot with regressions when work_mem is very low, and that particular theory isn't really tested here. > I'm not sure which commit are you referring to. The benchmark was done on > a414d96a (from 2016/3/10). However I'd expect that to affect both sets of > measurements, although it's possible that it affects the patched version > differently. You did test the right patches. It just so happens that the master branch now has the memory batching stuff now, so it doesn't get credited with that. I think this is good, though, because we care about 9.5 -> 9.6 regressions. Improvement ratio (master time/patched time) for Xeon 10 million row case "SELECT * FROM int_test_padding ORDER BY a DESC": For work_mem of 8MB = 0.83, 32MB = 0.62, 128MB = 0.52, 512MB = 0.47, 1024MB = 1.00 So, it gets faster than the master branch as more memory is available, but then it goes to 1.00 -- a perfect draw. I think that this happened simply because at that point, the sort was an internal sort (even though similar CREATE INDEX case did not go internal at the same point). The (internal) 1024MB case is not that much faster than the 512MB external case, which is pretty good. There are also "near draws", where the ratio is 0.98 or so. I think that this is because abbreviation is aborted, which can be a problem with synthetic data + text -- you get a very slow sort either way, where most time is spent calling strcoll(), and cache characteristics matter much less. Those cases seemingly take much longer overall, so this theory makes sense. Unfortunately, abbreviated keys for text that is not C locale text was basically disabled across the board today due to a glibc problem. :-( Whenever I see that the patch is exactly as fast as the master branch, I am skeptical. I am particularly skeptical of all i5 results (including 10M cases), because the patch seems to be almost perfectly matched to the master branch for CREATE INDEX cases (which are the best cases for the patch on your Xeon server) -- it's much easier to believe that there was a problem during the test, honestly, like maintenance_work_mem wasn't set correctly. Those two things are so different that I have a hard time imagining that they'd ever really draw. I mean, it's possible, but it's more likely to be a problem with testing. And, queries like "SELECT * FROM int_test_padding ORDER BY a DESC" return all rows, which adds noise from all the client overhead. In fact, you often see that adding more memory helps no case here, so it seem a bit pointless. Maybe they should be written like "SELECT * FROM (select * from int_test_padding ORDER BY a DESC OFFSET 1e10) ff" instead. And maybe queries like "SELECT DISTINCT a FROM int_test ORDER BY a" would be better as "SELECT COUNT(DISTINCT a) FROM int_test", in order to test the datum/aggregate case. Just suggestions. If you really wanted to make the patch look good, a sort with 5GB of work_mem is the best way, FWIW. The heap data structure used by the master branch tuplesort.c will handle that very badly. You use no temp_tablespaces here. I wonder if the patch would do better with that. Sorting can actually be quite I/O bound with the patch sometimes, where it's usually CPU/memory bound with the heap, especially with lots of work_mem. More importantly, it would be more informative if the temp_tablespace was not affected by I/O from Postgres' heap. I also like seeing a sample of "trace_sort = on" output. I don't expect you to carefully collect that in every case, but it can tell
Re: [HACKERS] Using quicksort for every external sort run
Hi, On 03/22/2016 11:07 PM, Peter Geoghegan wrote: On Tue, Mar 22, 2016 at 2:27 PM, Tomas Vondra wrote: Each query was executed 5x for each work_mem value (between 8MB and 1GB), and then a median of the runs was computed and that's what's on the "comparison". This compares a414d96ad2b without (master) and with the patches applied (patched). The last set of columns is simply a "speedup" where "<1.0" means the patched code is faster, while >1.0 means it's slower. Values below 0.9 or 1.1 are using green or red background, to make the most significant improvements or regressions clearly visible. For the smaller data set (1M rows), things works pretty fine. There are pretty much no red cells (so no significant regressions), but quite a few green ones (with duration reduced by up to 50%). There are some results in the 1.0-1.05 range, but considering how short the queries are, I don't think this is a problem. Overall the total duration was reduced by ~20%, which is nice. For the 10M data sets, total speedup is also almost ~20%, and the speedups for most queries are also very nice (often ~50%). To be clear, you seem to mean that ~50% of the runtime of the query was removed. In other words, the quicksort version is twice as fast. Yes, that's what I meannt. Sorry for the inaccuracy. But the number of regressions is considerably higher - there's a small number of queries that got significantly slower for multiple data sets, particularly for smaller work_mem values. No time to fully consider these benchmarks right now, but: Did you make sure to set replacement_sort_mem very low so that it was never used when patched? And, was this on the latest version of the patch, where memory contexts were reset (i.e. the version that got committed recently)? You said something about memory batching, so ISTM that you should set that to '64', to make sure you don't get one longer run. That might mess with merging. I've tested the patch you've sent on 2016/3/11, which I believe is the last version. I haven't tuned the replacement_sort_mem at all. because my understanding was that it's not a 9.6 material (per your message). So my intent was to test the configuration people are likely to use by default. I'm not sure about the batching - that was merely a guess of what might be the problem. Note that the master branch has the memory batching patch as of a few days back, so it that's the problem at the low end, then that's bad. I'm not sure which commit are you referring to. The benchmark was done on a414d96a (from 2016/3/10). However I'd expect that to affect both sets of measurements, although it's possible that it affects the patched version differently. But I don't think it is: I think that the regressions at the low end are about abbreviated keys, particularly the numeric cases. There is a huge gulf in the cost of those comparisons (abbreviated vs authoritative), and it is legitimately a weakness of the patch that it reduces the number in play. I think it's still well worth it, but it is a downside. There is no reason why the authoritative numeric comparator has to allocate memory, but right now that case isn't optimized. Yes, numeric and text are the most severely affected cases. I find it weird that the patch is exactly the same as master in a lot of cases. ISTM that with a case where you use 1GB of memory to sort 1 million rows, you're so close to an internal sort that it hardly matters (master will not need a merge step at all, most likely). The patch works best with sorts that take tens of seconds, and I don't think I see any here, nor any high memory tests where RS flops. Now, I think you focused on regressions because that was what was interesting, which is good. I just want to put that in context. I don't think the tests on 1M rows are particularly interesting, and I don't see any noticeable regressions there. Perhaps you mean the tests on 10M rows instead? Yes, you're correct - I was mostly looking for regressions. However, the worst cases of regressions are on relatively long sorts, e.g. slowing down from 35 seconds to 64 seconds, etc. So that's quite long, and it's surely using non-trivial amount of memory. Or am I missing something? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Tue, Mar 22, 2016 at 2:27 PM, Tomas Vondra wrote: > Each query was executed 5x for each work_mem value (between 8MB and 1GB), > and then a median of the runs was computed and that's what's on the > "comparison". This compares a414d96ad2b without (master) and with the > patches applied (patched). The last set of columns is simply a "speedup" > where "<1.0" means the patched code is faster, while >1.0 means it's slower. > Values below 0.9 or 1.1 are using green or red background, to make the most > significant improvements or regressions clearly visible. > > For the smaller data set (1M rows), things works pretty fine. There are > pretty much no red cells (so no significant regressions), but quite a few > green ones (with duration reduced by up to 50%). There are some results in > the 1.0-1.05 range, but considering how short the queries are, I don't think > this is a problem. Overall the total duration was reduced by ~20%, which is > nice. > > For the 10M data sets, total speedup is also almost ~20%, and the speedups > for most queries are also very nice (often ~50%). To be clear, you seem to mean that ~50% of the runtime of the query was removed. In other words, the quicksort version is twice as fast. > But the number of > regressions is considerably higher - there's a small number of queries that > got significantly slower for multiple data sets, particularly for smaller > work_mem values. No time to fully consider these benchmarks right now, but: Did you make sure to set replacement_sort_mem very low so that it was never used when patched? And, was this on the latest version of the patch, where memory contexts were reset (i.e. the version that got committed recently)? You said something about memory batching, so ISTM that you should set that to '64', to make sure you don't get one longer run. That might mess with merging. Note that the master branch has the memory batching patch as of a few days back, so it that's the problem at the low end, then that's bad. But I don't think it is: I think that the regressions at the low end are about abbreviated keys, particularly the numeric cases. There is a huge gulf in the cost of those comparisons (abbreviated vs authoritative), and it is legitimately a weakness of the patch that it reduces the number in play. I think it's still well worth it, but it is a downside. There is no reason why the authoritative numeric comparator has to allocate memory, but right now that case isn't optimized I find it weird that the patch is exactly the same as master in a lot of cases. ISTM that with a case where you use 1GB of memory to sort 1 million rows, you're so close to an internal sort that it hardly matters (master will not need a merge step at all, most likely). The patch works best with sorts that take tens of seconds, and I don't think I see any here, nor any high memory tests where RS flops. Now, I think you focused on regressions because that was what was interesting, which is good. I just want to put that in context. Thanks -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Thu, Mar 17, 2016 at 1:13 PM, Robert Haas wrote: > OK, I have now committed 0001 I attach a revision of the external quicksort patch and supplementary small patches, rebased on top of the master branch. Changes: 1. As I mentioned on the "Improve memory management for external sorts" -committers thread, we should protect against currentRun integer overflow. This new revision does so. I'm not sure if that change needs to be back-patched; I just don't want to take any risks, and see this as low cost insurance. Really low workMem sorts are now actually fast enough that this seems like something that could happen on a misconfigured system. 2. Add explicit constants for special run numbers that replacement selection needs to care about in particular. I did this because change 1 reminded me of the "currentRun vs. SortTuple.tupindex" run numbering subtleties. The explicit use of certain run number constants seems to better convey some tricky details, in part by letting me add a few documenting if obvious assertions. It's educational to be able to grep for the these constants (e.g., the new HEAP_RUN_NEXT constant) to jump to the parts of the code that need to think about replacement selection. As things were, that code relied on too much from too great a distance (arguably this is true even in the master branch). This change in turn led to minor wordsmithing to adjacent areas here and there, most of it subtractive. As an example of where this helps, ISTM that the assertion added to the routine tuplesort_heap_insert() is now self-documenting, which wasn't the case before. 3. There was one very tricky consideration around an edge-case that required careful thought. This was an issue within my new function dumpbatch(). It could previously perform what turns out to be a superfluous selectnewtape() call when we take the dumpbatch() "state->memtupcount == 0" early return path (see the last revision for full details of that now-axed code path). Now, we accept that there may on occasion be 0 tuple runs. In other words, we now never returned early from within dumpbatch(). There was previously no explanation for why it was okay to have a superfluous selectnewtape() call. However, needing to be certain that any newly selected destTape tape will go on to receive a run is implied for the general case by this existing selectnewtape() code comment: * This is called after finishing a run when we know another run * must be started. This implements steps D3, D4 of Algorithm D. While the previous revision was correct anyway, I tried to explain why it was correct in comments, and soon realized that that was a really bad idea; the rationale was excessively complicated. Allowing 0 tuple runs in rare cases seems like the simplest solution. After all, mergeprereadone() is expressly prepared for 0 tuple runs. It says "ensure that we have at least one tuple, if any are to be had". There is no reason to assume that it says this only because it imagines that no tuples might be found *only after* the first preread for the merge (by which I mean I don't think that only applies when a final on-the-fly merge reloads tuples from one particular tape following running out of tuples of the tape/run in memory). 4. I updated the function beginmerge() to acknowledge an inconsistency for pass-by-value datumsorts, which I mentioned in passing on this thread a few days back. The specific change: @@ -2610,7 +2735,12 @@ beginmerge(Tuplesortstate *state, bool finalMergeBatch) if (finalMergeBatch) { - /* Free outright buffers for tape never actually allocated */ + /* +* Free outright buffers for tape never actually allocated. The +* !state->tuples case is actually entitled to have at least this much +* of a refund ahead of its final merge, but we don't go out of our way +* to marginally improve that case. +*/ FREEMEM(state, (state->maxTapes - activeTapes) * TAPE_BUFFER_OVERHEAD); It's not worth worrying about this case, since the savings are small (especially now that maxTapes is capped). But it's worth acknowledging that the "!state->tuples" case is being "short-changed", in the new spirit of heavily scrutinizing where memory goes in tuplesort.c. 5. I updated the "Add MemoryContextStats() calls for debugging" patch. I now formally propose that this debugging instrumentation be committed. This revised debugging instrumentation patch does not have the system report anything about the memory context just because "trace_sort = on". Rather, it does nothing on ordinary builds, where the macro SHOW_MEMORY_STATS will not be defined (it also respects trace_sort). This is about the same approach seen in postgres.c's finish_xact_command(). ISTM that we ought to provide a way of debugging memory use within tuplesort.c, since we now know that that could be very important. Let's not forget where the useful places to look for problems are. 6. Based on your feedback on t
Re: [HACKERS] Using quicksort for every external sort run
On Thu, Mar 10, 2016 at 9:54 PM, Peter Geoghegan wrote: > 1. Creates a separate memory context for tuplesort's copies of > caller's tuples, which can be reset at key points, avoiding > fragmentation. Every SortTuple.tuple is allocated there (with trivial > exception); *everything else*, including the memtuples array, is > allocated in the existing tuplesort context, which becomes the parent > of this new "caller's tuples" context. Roughly speaking, that means > that about half of total memory for the sort is managed by each > context in common cases. Even with a high work_mem memory budget, > memory fragmentation could previously get so bad that tuplesort would > in effect claim a share of memory from the OS that is *significantly* > higher than the work_mem budget allotted to its sort. And with low > work_mem settings, fragmentation previously made palloc() thrash the > sort, especially during non-final merging. In this latest revision, > tuplesort now almost gets to use 100% of the memory that was requested > from the OS by palloc() is cases tested. I spent some time looking at this part of the patch yesterday and today. This is not a full review yet, but here are some things I noticed: - I think that batchmemtuples() is somewhat weird. Normally, grow_memtuples() doubles the size of the array each time it's called. So if you somehow called this function when you still had lots of memory available, it would just double the size of the array. However, I think the expectation is that it's only going to be called when availMem is less than half of allowedMem, in which case we're going to get the special "last increment of memtupsize" behavior, where we expand the memtuples array by some multiple between 1.0 and 2.0 based on allowedMem/memNowUsed. And after staring at this for a long time ... yeah, I think this does the right thing. But it certainly is hairy. - It's not exactly clear what you mean when you say that the tuplesort context contains "everything else". I don't understand why that only ends up containing half the memory ... what, other than the memtuples array, ends up there? - If I understand correctly, the point of the MemoryContextReset call is: there wouldn't be any tuples in memory at that point anyway. But the OS-allocated chunks might be divided up into a bunch of small chunks that then got stuck on freelists, and those chunks might not be the right size for the next merge pass. Resetting the context avoids that problem by blowing up the freslists. Right? Clever. - I haven't yet figured out why we use batching only for the final on-the-fly merge pass, instead of doing it for all merges. I expect you have a reason. I just don't know what it is. - I have also not yet figured out why you chose to replace state->datumTypByVal with state->tuples and reverse the sense. I bet there's a reason for this, too. I don't know what it is, either. That's as far as I've gotten thus far. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Wed, Mar 16, 2016 at 9:42 PM, Peter Geoghegan wrote: > On Wed, Mar 16, 2016 at 3:31 PM, Robert Haas wrote: >> I spent some time looking at this part of the patch yesterday and >> today. > > Thanks for getting back to it. OK, I have now committed 0001, and separately, some comment improvements - or at least, I think they are improvements - based on this discussion. Thanks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Wed, Mar 16, 2016 at 3:31 PM, Robert Haas wrote: > I spent some time looking at this part of the patch yesterday and > today. Thanks for getting back to it. > - I think that batchmemtuples() is somewhat weird. Normally, > grow_memtuples() doubles the size of the array each time it's called. > So if you somehow called this function when you still had lots of > memory available, it would just double the size of the array. > However, I think the expectation is that it's only going to be called > when availMem is less than half of allowedMem, in which case we're > going to get the special "last increment of memtupsize" behavior, > where we expand the memtuples array by some multiple between 1.0 and > 2.0 based on allowedMem/memNowUsed. That's right. It might be possible for the simple doubling behavior to happen under artificial conditions instead, for example when we have enormous individual tuples, but if that does happen it's still correct. I just didn't think it was worth worrying about giving back more memory in such extreme edge-cases. > And after staring at this for a > long time ... yeah, I think this does the right thing. But it > certainly is hairy. No arguments from me here. I think this is justified, though. It's great that palloc() provides a simple, robust abstraction. However, there are a small number of modules in the code, including tuplesort.c, where we need to be very careful about memory management. Probably no more than 5 and no less than 3. In these places, large memory allocations are the norm. We ought to pay close attention to memory locality, heap fragmentation, that memory is well balanced among competing considerations, etc. It's entirely appropriate that we'd go to significant lengths to get it right in these places using somewhat ad-hoc techniques, simply because these are the places where we'll get a commensurate benefit. Some people might call this adding custom memory allocators, but I find that to be a loaded term because it suggests intimate involvement from mcxt.c. > - It's not exactly clear what you mean when you say that the tuplesort > context contains "everything else". I don't understand why that only > ends up containing half the memory ... what, other than the memtuples > array, ends up there? I meant that the existing memory context "sortcontext" contains everything else that has anything to do with sorting. Everything that it contains in the master branch it continues to contain today, with the sole exception of a vast majority of caller's tuples. So, "sortcontext" continues to include everything you can think of: * As you pointed out, the memtuples array. * SortSupport state (assuming idiomatic usage of the API, at least). * State specific to the cluster case. * Transient state, specific to the index case (i.e. scankey memory) * logtape.c stuff. * Dynamically allocated stuff for managing tapes (see inittapes()) * For the sake of simplicity, a tiny number of remaining tuples (from "overflow" allocations, or from when we need to free a tape's entire batch when it is one tuple from exhaustion). This is for tuples that the tuplesort caller needs to pfree() anyway, per the tuplesort_get*tuple() API. It's just easier to put these allocations in the "main" context, to avoid having to reason about any consequences to calling MemoryContextReset() against our new tuple context. This precaution is just future-proofing IMV. I believe that this list is exhaustive. > - If I understand correctly, the point of the MemoryContextReset call > is: there wouldn't be any tuples in memory at that point anyway. But > the OS-allocated chunks might be divided up into a bunch of small > chunks that then got stuck on freelists, and those chunks might not be > the right size for the next merge pass. Resetting the context avoids > that problem by blowing up the freslists. Right? Your summary of the practical benefit is accurate. While I've emphasized regressions at the low-end with this latest revision, it's also true that resetting helps in memory rich environments, when we switch from retail palloc() calls to the final merge step's batch allocation, which palloc() seemed to do very badly with. It makes sense that this abrupt change in the pattern of allocations could cause significant heap memory fragmentation. > Clever. Thanks. Introducing a separate memory context that is strictly used for caller tuples makes it clear and obvious that it's okay to call MemoryContextReset() when state->memtupcount == 0. It's not okay to put anything in the new context that could break the calls to MemoryContextReset(). You might not have noticed that a second MemoryContextReset() call appears in the quicksort patch, which helps a bit too. I couldn't easily make that work with the replacement selection heap, because master's tupleosrt.c never fully empties its RS heap until the last run. I can only perform the first call to MemoryContextReset() in the memory patch because it
Re: [HACKERS] Using quicksort for every external sort run
On Wed, Mar 16, 2016 at 6:42 PM, Peter Geoghegan wrote: >> - I think that batchmemtuples() is somewhat weird. Normally, >> grow_memtuples() doubles the size of the array each time it's called. >> So if you somehow called this function when you still had lots of >> memory available, it would just double the size of the array. >> However, I think the expectation is that it's only going to be called >> when availMem is less than half of allowedMem, in which case we're >> going to get the special "last increment of memtupsize" behavior, >> where we expand the memtuples array by some multiple between 1.0 and >> 2.0 based on allowedMem/memNowUsed. > > That's right. It might be possible for the simple doubling behavior to > happen under artificial conditions instead, for example when we have > enormous individual tuples, but if that does happen it's still > correct. I just didn't think it was worth worrying about giving back > more memory in such extreme edge-cases. Come to think of it, maybe the pass-by-value datum sort case should also call batchmemtuples() too (or something similar). If you look at how beginmerge() is called, you'll see that that doesn't happen. Obviously this case is not entitiled to a "memtupsize * STANDARDCHUNKHEADERSIZE" refund, since of course there never was any overhead like that at any point. And, obviously this case has no need for batch memory at all. However, it is entitled to get a refund for non-used tapes (accounted for, but, it turns out, never allocated tapes). It should then get the benefit of that refund by way of growing memtuples through a similar "final, honestly, I really mean it this time" call to grow_memtuples(). So, while the "memtupsize * STANDARDCHUNKHEADERSIZE refund" part should still be batch-specific (i.e. used for the complement of tuplesort cases, never the datum pass-by-val case), the new grow_memtuples() thing should always happen with external sorts. The more I think about it, the more I wonder if we should commit something like the debugging patch 0004-* (enabled only when trace_sort = on, of course). Close scrutiny of what tuplesort.c is doing with memory is important. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Wed, Mar 16, 2016 at 9:42 PM, Peter Geoghegan wrote: >> - I haven't yet figured out why we use batching only for the final >> on-the-fly merge pass, instead of doing it for all merges. I expect >> you have a reason. I just don't know what it is. > > The most obvious reason, and possibly the only reason, is that I have > license to lock down memory accounting in the final on-the-fly merge > phase. Almost equi-sized runs are the norm, and code like this is no > longer obligated to work: > > FREEMEM(state, GetMemoryChunkSpace(stup->tuple)); > > That's why I explicitly give up on "conventional accounting". USEMEM() > and FREEMEM() calls become unnecessary for this case that is well > locked down. Oh, and I know that I won't use most tapes, so I can give > myself a FREEMEM() refund before doing the new grow_memtuples() thing. > > I want to make batch memory usable for runs, too. I haven't done that > either for similar reasons. FWIW, I see no great reason to worry about > non-final merges. Fair enough. My concern was mostly whether the code would become simpler if we always did this when merging, instead of only on the final merge. But the final merge seems to be special in quite a few respects, so maybe not. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Thu, Mar 17, 2016 at 1:13 PM, Robert Haas wrote: > OK, I have now committed 0001, and separately, some comment > improvements - or at least, I think they are improvements - based on > this discussion. Thanks! Your changes look good to me. It's always interesting to learn what wasn't so obvious to you when you review my patches. It's probably impossible to stare at something like tuplesort.c for as long as I have and get that balance just right. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Sun, Feb 14, 2016 at 8:01 PM, Peter Geoghegan wrote: > The query I'm testing is: "reindex index pgbench_accounts_pkey;" > > Now, with a maintenance_work_mem of 5MB, the most recent revision of > my patch takes about 54.2 seconds to complete this, as compared to > master's 44.4 seconds. So, clearly a noticeable regression there of > just under 20%. I did not see a regression with a 5MB > maintenance_work_mem when pgbench scale was 100, though. I've fixed this regression, and possibly all regressions where workMem > 4MB. I've done so without resorting to making the heap structure more complicated, or using a heap more often than when replacement_sort_mem is exceeded by work_mem or maintenance_work_mem (so replacement_sort_mem becomes something a bit different to what we discussed, Robert -- more on that later). This seems like an "everybody wins" situation, because in this revision the patch series is now appreciably *faster* where the amount of memory available is only a tiny fraction of the total input size. Jeff Janes deserves a lot of credit for helping me to figure out how to do this. I couldn't get over his complaint about the regression he saw a few months back. He spoke of an "anti-sweetspot" in polyphase merge, and how he suspected that to be the real culprit (after all, most of his time was spent merging, with or without the patch applied). He also said that reverting the memory batch/pool patch made things go a bit faster, somewhat ameliorating his regression (when just the quicksort patch was applied). This made no sense to me, since I understood the memory batching patch to be orthogonal to the quicksort thing, capable of being applied independently, and more or less a strict improvement on master, no matter what the variables of the sort are. Jeff's regressed case especially made no sense to me (and, I gather, to him) given that the regression involved no correlation, and so clearly wasn't reliant on generating far fewer/longer runs than the patch (that's the issue we've discussed more than any other now -- it's a red herring, it seems). As I suspected out loud on February 14th, replacement selection mostly just *masked* the real problem: the problem of palloc() fragmentation. There doesn't seem to be much of an issue with the scheduling of polyphase merging, once you fix palloc() fragmentation. I've created a new revision, incorporating this new insight. New Revision Attached revision of patch series: 1. Creates a separate memory context for tuplesort's copies of caller's tuples, which can be reset at key points, avoiding fragmentation. Every SortTuple.tuple is allocated there (with trivial exception); *everything else*, including the memtuples array, is allocated in the existing tuplesort context, which becomes the parent of this new "caller's tuples" context. Roughly speaking, that means that about half of total memory for the sort is managed by each context in common cases. Even with a high work_mem memory budget, memory fragmentation could previously get so bad that tuplesort would in effect claim a share of memory from the OS that is *significantly* higher than the work_mem budget allotted to its sort. And with low work_mem settings, fragmentation previously made palloc() thrash the sort, especially during non-final merging. In this latest revision, tuplesort now almost gets to use 100% of the memory that was requested from the OS by palloc() is cases tested. 2. Loses the "quicksort with spillover" case entirely, making the quicksort patch significantly simpler. A *lot* of code was thrown out. This change is especially significant because it allowed me to remove the cost model that Robert took issue with so vocally. "Quicksort with spillover" was always far less important than the basic idea of using quicksort for external sorts, so I'm not sad to see it go. And, I thought that the cost model was pretty bad myself. 3. Fixes cost_sort(), making optimizer account for the fact that runs are now about sort_mem-sized, not (sort_mem * 2)-sized. While I was at it, I made cost_sort() more optimistic about the amount of random I/O required relative to sequential I/O. This additional change to cost_sort() was probably overdue. 4. Restores the ability of replacement selection to generate one run and avoid any merging (previously, only one really long run and one short run was possible, because at the time I conceptualized replacement selection as being all about enabling "quicksort with spillover", which quicksorted that second run in memory). This only-one-run case is the case that Robert particularly cared about, and it's fully restored when RS is in use (which can still only happen for the first run, just never for the benefit of the now-axed "quicksort with spillover" case). 5. Adds a new GUC, "replacement_sort_mem". The default setting is 16MB. Docs are included. If work_mem/maintenance_work_mem is less than or equal to this, the first (and hopefully only) run uses r
Re: [HACKERS] Using quicksort for every external sort run
On Thu, Mar 10, 2016 at 10:39 AM, Greg Stark wrote: > I want to rerun these on a dedicated machine and with trace_sort > enabled so that we can see how many merge passes were actually > happening and how much I/O was actually happening. Putting the results in context, by keeping trace_sort output with the results is definitely a good idea here. Otherwise, it's almost impossible to determine what happened after the fact. I have had "trace_sort = on" in my dev postgresql.conf for some time now. :-) When I produce my next revision, we should focus on regressions at the low end, like the 4MB work_mem for multiple GB table size cases you show here. So, I ask that any benchmarks that you or Tomas do look at that first and foremost. It's clear that in high memory environments the patch significantly improves performance, often by as much as 2.5x, and so that isn't really a concern anymore. I think we may be able to comprehensively address Robert's concerns about regressions with very little work_mem and lots of data by fixing a problem with polyphase merge. More to come soon. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Thu, Mar 10, 2016 at 1:40 PM, Tomas Vondra wrote: > I was thinking about running some benchmarks on this patch, but the > thread is pretty huge so I want to make sure I'm not missing something > and this is indeed the most recent version. I also ran some preliminary benchmarks just before FOSDEM and intend to get back to in after running different benchmarks. These are preliminary because it was only a single run and on a machine that wasn't dedicated for benchmarks. These were comparing the quicksort-all-runs patch against HEAD at the time without the memory management optimizations which I think are independent of the sort algorithm. It looks to me like the interesting space to test is on fairly small work_mem compared to the data size. There's a general slowdown on 4MB-8MB work_mem when the data set is more than a gigabyte but but even in the worst case it's only a 30% slowdown and the speedup in the more realistic scenarios looks at least as big. I want to rerun these on a dedicated machine and with trace_sort enabled so that we can see how many merge passes were actually happening and how much I/O was actually happening. -- greg
Re: [HACKERS] Using quicksort for every external sort run
On Thu, Mar 10, 2016 at 5:40 AM, Tomas Vondra wrote: > I was thinking about running some benchmarks on this patch, but the > thread is pretty huge so I want to make sure I'm not missing something > and this is indeed the most recent version. Wait 24 - 48 hours, please. Big update coming. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
Hi, On Mon, 2015-12-28 at 15:03 -0800, Peter Geoghegan wrote: > On Fri, Dec 18, 2015 at 11:57 AM, Peter Geoghegan wrote: > > BTW, I'm not necessarily determined to make the new special-purpose > > allocator work exactly as proposed. It seemed useful to prioritize > > simplicity, and currently so there is one big "huge palloc()" with > > which we blow our memory budget, and that's it. However, I could > > probably be more clever about "freeing ranges" initially preserved for > > a now-exhausted tape. That kind of thing. > > Attached is a revision that significantly overhauls the memory patch, > with several smaller changes. I was thinking about running some benchmarks on this patch, but the thread is pretty huge so I want to make sure I'm not missing something and this is indeed the most recent version. Is that the case? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Mon, Feb 15, 2016 at 3:45 PM, Greg Stark wrote: > I was thinking about this over the past couple weeks. I'm starting to > think the quicksort runs gives at least the beginnings of a way > forward on this front. As I've already pointed out several times, I wrote a tool that makes it easy to load sortbenchmark.org data into a PostgreSQL table: https://github.com/petergeoghegan/gensort (You should use the Python script that invokes the "gensort" utility -- see its "--help" display for details). This seems useful as a standard benchmark, since it's perfectly deterministic, allowing the user to create arbitrarily large tables to use for sort benchmarks. Still, it doesn't produce data that is any way organic; sort data is uniformly distributed. Also, it produces a table that really only has one attribute to sort on, a text attribute. I suggest looking at real world data, too. I have downloaded UK land registry data, which is a freely available dataset about property sales in the UK since the 1990s, of which there have apparently been about 20 million (I started with a 20 million line CSV file). I've used COPY to load the data into one PostgreSQL table. I attach instructions on how to recreate this, and some suggested CREATE INDEX statements that seemed representative to me. There are a variety of Postgres data types in use, including UUID, numeric, and text. The final Postgres table is just under 3GB. I will privately make available a URL that those CC'd here can use to download a custom format dump of the table, which comes in at 1.1GB (ask me off-list if you'd like to get that URL, but weren't CC'd here). This URL is provided as a convenience for reviewers, who can skip my detailed instructions. An expensive rollup() query on the "land_registry_price_paid_uk" table is interesting. Example: select date_trunc('year', transfer_date), county, district, city, sum(price) from land_registry_price_paid_uk group by rollup (1, county, district, city); Performance is within ~5% of an *internal* sort with the patch series applied, even though ~80% of time is spent copying and sorting SortTuples overall in the internal sort case (the internal case cannot overlap sorting and aggregate processing, since it has no final merge step). This is a nice demonstration of how this work has significantly blurred the line between internal and external sorts. -- Peter Geoghegan Instructions CSV File The land registry file from http://data.gov.uk is 3.2GB. A CSV file that can be loaded into PostgreSQL that has organic data. No registration required. See https://theodi.org/blog/the-status-of-csvs-on-datagovuk for details on downloaded the file pp-complete.csv. SQL --- begin; create table land_registry_price_paid_uk( transaction uuid, price numeric, transfer_date date, postcode text, property_type char(1), newly_built boolean, duration char(1), paon text, saon text, street text, locality text, city text, district text, county text, ppd_category_type char(1)); copy land_registry_price_paid_uk FROM '/home/pg/Downloads/pp-complete.csv' with (format csv, freeze true, encoding 'win1252', header false, null '', quote '"', force_null (postcode, saon, paon, street, locality, city, district)); commit; Resulting table --- postgres=# \dt+ List of relations Schema │Name │ Type │ Owner │ Size │ Description ────────┼─────────────────────────────┼───────┼───────┼─────────┼───────────── public │ land_registry_price_paid_uk │ table │ pg│ 2779 MB │ (1 row) Interesting Indexes === Many attribute index (Low cardinality leading attributes): postgres=# create index on land_registry_price_paid_uk_suffix(county, district, city, locality, street); UUID pk index (UUID type, high cardinality): postgres=# create index on land_registry_price_paid_uk (transaction); Price index (numeric, moderate cardinality): postgres=# create index on land_registry_price_paid_uk (price); Preview === pg@hamster:~$ head ~/Downloads/pp-complete.csv "{0C7ADEF5-878D-4066-B785-003ED74A}","163000","2003-02-21 00:00","UB5 4PJ","T","N","F","106","","READING ROAD","NORTHOLT","NORTHOLT","EALING","GREATER LONDON","A" "{35F67271-ABD4-40DA-AB09-0085B9D3}","247500","2005-07-15 00:00","TA19 9DD","D","N","F","58","","ADAMS MEADOW","ILMINSTER","ILMINSTER","SOUTH SOMERSET","SOMERSET","A" "{B20B1C74-E8E1-4137-AB3E-011DF342}","32","2010-09-10 00:00","W4 1DZ","F","N","L","58","","WHELLOCK ROAD","","LONDON","EALING","GREATER LONDON","A" "{7D6B0915-C56B-4275-AF9B-0156BCE7}","104000","1997-08-27 00:00","NE61 2BH","D","N","F","17","","WESTGATE","MORPETH","MORPETH","CASTLE MORPETH","NORTHUMBERLAND","A" "{47B60101-B64C-413D-8F60-0
Re: [HACKERS] Using quicksort for every external sort run
On Mon, Feb 15, 2016 at 8:43 PM, Jim Nasby wrote: > On 2/7/16 8:57 PM, Peter Geoghegan wrote: >> >> It seems less than ideal that DBAs have to be so >> conservative in sizing work_mem. > > > +10 I was thinking about this over the past couple weeks. I'm starting to think the quicksort runs gives at least the beginnings of a way forward on this front. Keeping in mind that we know how many tapes we can buffer in memory and the number is likely to be relatively large -- on the order of 100+ is typical, what if do something like the following rough sketch: Give users two knobs, a lower bound "sort in memory using quicksort" memory size and an upper bound "absolutely never use more than this" which they can set to a substantial fraction of physical memory. Then when we overflow the lower bound we start generating runs, the first one being of that length. Each run we generate we double (or increase by 50% or something) until we hit the maximum. That means that the first few runs may be shorter than necessary but we have enough tapes available that that doesn't hurt much and we'll eventually get to a large enough run size that we won't run out of tapes and can still do a single final (on the fly) merge. In fact what's really attractive about this idea is that it might give us a reasonable spot to do some global system resource management. Each time we want to increase the run size we check some shared memory counter of how much memory is in use and refuse to increase if there's too much in use (or if we're using too large a fraction of it or some other heuristic). The key point is that since we don't need to decide up front at the beginning of the sort and we don't need to track it continuously there is neither too little nor too much contention on this shared memory variable. Also the behaviour would be not too chaotic if there's a user-tunable minimum and the other activity in the system only controls how more memory it can steal from the global pool on top of that. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On 2/7/16 8:57 PM, Peter Geoghegan wrote: It seems less than ideal that DBAs have to be so conservative in sizing work_mem. +10 -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Sun, Feb 7, 2016 at 4:50 PM, Peter Geoghegan wrote: >> I'm not even sure this is necessary. The idea of missing out on >> producing a single sorted run sounds bad but in practice since we >> normally do the final merge on the fly there doesn't seem like there's >> really any difference between reading one tape or reading two or three >> tapes when outputing the final results. There will be the same amount >> of I/O happening and a 2-way or 3-way merge for most data types should >> be basically free. > > I basically agree with you, but it seems possible to fix the > regression (generally misguided though those regressed cases are). > It's probably easiest to just fix it. Here is a benchmark on my laptop: $ pgbench -i -s 500 --unlogged This results in a ~1GB accounts PK: postgres=# \di+ pgbench_accounts_pkey List of relations ─[ RECORD 1 ]── Schema │ public Name│ pgbench_accounts_pkey Type│ index Owner │ pg Table │ pgbench_accounts Size│ 1071 MB Description │ The query I'm testing is: "reindex index pgbench_accounts_pkey;" Now, with a maintenance_work_mem of 5MB, the most recent revision of my patch takes about 54.2 seconds to complete this, as compared to master's 44.4 seconds. So, clearly a noticeable regression there of just under 20%. I did not see a regression with a 5MB maintenance_work_mem when pgbench scale was 100, though. And, with the default maintenance_work_mem of 64MB, it's a totally different story -- my patch takes about 28.3 seconds, whereas master takes 48.5 seconds (i.e. longer than with 5MB). My patch needs a 56-way final merge with the 64MB maintenance_work_mem case, and 47 distinct merge steps, plus a final on-the-fly merge for the 5MB maintenance_work_mem case. So, a huge amount of merging, but RS still hardly pays for itself. With the regressed case for my patch, we finish sorting *runs* about 15 seconds in to a 54.2 second operation -- very early. So it isn't "quicksort vs replacement selection", so much as "polyphase merge vs replacement selection". There is a good reason to think that we can make progress on fixing that regression by doubling down on the general strategy of improving cache characteristics, and being cleverer about memory use during non-final merging, too. I looked at what it would take to make the heap a smaller part of memtuples, along the lines Robert and I talked about, and I think it's non-trivial because it needs to make the top of the heap something other than memtuples[0]. I'd need to change the heap code, which already has 3 reasons for existing (RS, merging, and top-N heap). I'll find it really hard to justify the effort, and especially the risk of adding bugs, for a benefit that there is *scant* evidence for. My guess is that the easiest, and most sensible way to fix the ~20% regression seen here is to introduce batch memory allocation to non-final merge steps, which is where most time was spent. (For simplicity, that currently only happens during the final merge phase, but I could revisit that -- seems not that hard). Now, I accept that the cost model has to go. So, what I think would be best is if we still added a GUC, like the replacement_sort_mem suggestion that Robert made. This would be a threshold for using what is currently called "quicksort with spillover". There'd be no cost model. Jeff Janes also suggested something like this. The only regression that I find concerning is the one reported by Jeff Janes [1]. That didn't even involve a correlation, though, so no reason to think that it would be at all helped by what Robert and I talked about. It seemed like the patch happened to have the effect of tickling a pre-existing problem with polyphase merge -- what Jeff called an "anti-sweetspot". Jeff had a plausible theory for why that is. So, what if we try to fix polyphase merge? That would be easier. We could look at the tape buffer size, and the number of tapes, as well as memory access patterns. We might even make more fundamental changes to polyphase merge, since we don't use the more advanced variant that I think correlation is a red herring. Knuth suggests that his algorithm 5.4.3, cascade merge, has more efficient distribution of runs. The bottom line is that there will always be some regression somewhere. I'm not sure what the guiding principle for when that becomes unacceptable is, but you did seem sympathetic to the idea that really low work_mem settings (e.g. 4MB) with really big inputs were not too concerning [2]. I'm emphasizing Jeff's case now because I, like you [2], am much more worried about maintenance_work_mem default cases with regressions than anything else, and that *was* such a case. Like Jeff Janes, I don't care about his other regression of about 5% [3], which involved a 4MB work_mem + 100 million tuples. The other case (the one I do care about) was 64MB + 400 million tuples, and was a much bigger regression, which is suggestive of the unpredictable
Re: [HACKERS] Using quicksort for every external sort run
On Sun, Feb 7, 2016 at 4:50 PM, Peter Geoghegan wrote: >> I'm not even sure this is necessary. The idea of missing out on >> producing a single sorted run sounds bad but in practice since we >> normally do the final merge on the fly there doesn't seem like there's >> really any difference between reading one tape or reading two or three >> tapes when outputing the final results. There will be the same amount >> of I/O happening and a 2-way or 3-way merge for most data types should >> be basically free. > > I basically agree with you, but it seems possible to fix the > regression (generally misguided though those regressed cases are). > It's probably easiest to just fix it. On a related note, we should probably come up with a way of totally supplanting the work_mem model with something smarter in the next couple of years. Something that treats memory as a shared resource even when it's allocated privately, per-process. This external sort stuff really smooths out the cost function of sorts. ISTM that that makes the idea of dynamic memory budgets (in place of a one size fits all work_mem) seem desirable for the first time. That said, I really don't have a good sense of how to go about moving in that direction at this point. It seems less than ideal that DBAs have to be so conservative in sizing work_mem. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Sun, Feb 7, 2016 at 10:51 AM, Greg Stark wrote: >> > You don't want to change the behavior of the current patch for the >> > second or subsequent run; that should remain a quicksort, pure and >> > simple. Do I have that right? >> >> Yes. > > I'm not even sure this is necessary. The idea of missing out on > producing a single sorted run sounds bad but in practice since we > normally do the final merge on the fly there doesn't seem like there's > really any difference between reading one tape or reading two or three > tapes when outputing the final results. There will be the same amount > of I/O happening and a 2-way or 3-way merge for most data types should > be basically free. I basically agree with you, but it seems possible to fix the regression (generally misguided though those regressed cases are). It's probably easiest to just fix it. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Sun, Feb 7, 2016 at 8:21 PM, Robert Haas wrote: > On Sun, Feb 7, 2016 at 11:00 AM, Peter Geoghegan wrote: > > It was my understanding, based on your emphasis on producing only a > > single run, as well as your recent remarks on this thread about the > > first run being special, that you are really only interested in the > > presorted case, where one run is produced. That is, you are basically > > not interested in preserving the general ability of replacement > > selection to double run size in the event of a uniform distribution. >... > > You don't want to change the behavior of the current patch for the > > second or subsequent run; that should remain a quicksort, pure and > > simple. Do I have that right? > > Yes. I'm not even sure this is necessary. The idea of missing out on producing a single sorted run sounds bad but in practice since we normally do the final merge on the fly there doesn't seem like there's really any difference between reading one tape or reading two or three tapes when outputing the final results. There will be the same amount of I/O happening and a 2-way or 3-way merge for most data types should be basically free. On Sun, Feb 7, 2016 at 8:21 PM, Robert Haas wrote: > 3. At this point, we have one sorted tape per worker. Perform a final > merge pass to get the final result. I don't even think you have to merge until you get one tape per worker. You can statically decide how many tapes you can buffer in memory based on work_mem and merge until you get N/workers tapes so that a single merge in the gather node suffices. I would expect that to nearly always mean the workers are only responsible for generating the initial sorted runs and the single merge pass is done in the gather node on the fly as the tuples are read. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Sun, Feb 7, 2016 at 11:00 AM, Peter Geoghegan wrote: > Right. Let me give you the executive summary first: I continue to > believe, following thinking about the matter in detail, that this is a > sensible compromise, that weighs everyone's concerns. It is pretty > close to a win-win. I just need you to confirm what I say here in > turn, so we're sure that we understand each other perfectly. Makes sense to me. >> The basic idea is that we will add a new GUC with a name like >> replacement_sort_mem that will have a default value in the range of >> 20-30MB; or possibly we will hardcode this value, but for purposes of >> this email I'm going to assume it's a GUC. If the value of work_mem >> or maintenance_work_mem, whichever applies, is smaller than the value >> of replacement_sort_mem, then the latter has no effect. > > By "no effect", you must mean that we always use a heap for the entire > first run (albeit for the tail, with a hybrid quicksort/heap > approach), but still use quicksort for every subsequent run, when it's > clearly established that we aren't going to get one huge run. Is that > correct? Yes. > It was my understanding, based on your emphasis on producing only a > single run, as well as your recent remarks on this thread about the > first run being special, that you are really only interested in the > presorted case, where one run is produced. That is, you are basically > not interested in preserving the general ability of replacement > selection to double run size in the event of a uniform distribution. > (That particular doubling property of replacement selection is now > technically lost by virtue of using this new hybrid model *anyway*, > although it will still make runs longer in general). > > You don't want to change the behavior of the current patch for the > second or subsequent run; that should remain a quicksort, pure and > simple. Do I have that right? Yes. > BTW, parallel sort should probably never use a heap anyway (ISTM that > that will almost certainly be based on external sorts in the end). A > heap is not really compatible with the parallel heap scan model. I don't think I agree with this part, though I think it's unimportant as far as the current patch is concerned. My initial thought is that parallel sort should work like this: 1. Each worker reads and sorts its input tuples just as it would in non-parallel mode. 2. If, at the conclusion of the sort, the input tuples are still in memory (quicksort) or partially in memory (quicksort with spillover), then write them all to a tape. If they are on multiple tapes, merge those to a single tape. If they are on a single tape, do nothing else at this step. 3. At this point, we have one sorted tape per worker. Perform a final merge pass to get the final result. The major disadvantage of this is that if the input hasn't been relatively evenly partitioned across the workers, the work of sorting will fall disproportionately on those that got more input. We could, in the future, make the logic more sophisticated. For example, if worker A is still reading the input and dumping sorted runs, worker B could start merging those runs. Or worker A could read tuples into a DSM instead of backend-private memory, and worker B could then sort them to produce a run. While such optimizations are clearly beneficial, I would not try to put them into a first parallel sort patch. It's too complicated. >> One thing I just thought of (after the call) is that it might be >> better for this GUC to be in units of tuples rather than in units of >> memory; it's not clear to me why the optimal heap size should be >> dependent on the tuple size, so we could have a threshold like 300,000 >> tuples or whatever. > > I think you're right that a number of tuples is the logical way to > express the heap size (as a GUC unit). I think that the ideal setting > for the GUC is large enough to recognize significant correlations in > input data, which may be clustered, but no larger (at least while > things don't all fit in L1 cache, or maybe L2 cache). We should "go > for broke" with replacement selection -- we don't aim for anything > less than ending up with 1 run by using the heap (merging 2 or 3 runs > rather than 4 or 6 is far less useful, maybe harmful, when one of them > is much larger). Therefore, I don't expect that we'll be practically > disadvantaged by having fewer "hands to juggle" tuples here (we'll > simply almost always have enough in practice -- more on that later). > FWIW I don't think that any benchmark we've seen so far justifies > doing less than "going for broke" with RS, even if you happen to have > a very conservative perspective. > > One advantage of a GUC is that you can set it to zero, and always get > a simple hybrid sort-merge strategy if that's desirable. I think that > it might not matter much with multi-gigabyte work_mem settings anyway, > though; you'll just see a small blip. Big (maintenance_)work_mem was > by far my
Re: [HACKERS] Using quicksort for every external sort run
On Fri, Feb 5, 2016 at 9:31 AM, Robert Haas wrote: > Peter, please weigh in and let me know if I've gotten anything > incorrect here or if you think of other concerns afterwards. Right. Let me give you the executive summary first: I continue to believe, following thinking about the matter in detail, that this is a sensible compromise, that weighs everyone's concerns. It is pretty close to a win-win. I just need you to confirm what I say here in turn, so we're sure that we understand each other perfectly. > The basic idea is that we will add a new GUC with a name like > replacement_sort_mem that will have a default value in the range of > 20-30MB; or possibly we will hardcode this value, but for purposes of > this email I'm going to assume it's a GUC. If the value of work_mem > or maintenance_work_mem, whichever applies, is smaller than the value > of replacement_sort_mem, then the latter has no effect. By "no effect", you must mean that we always use a heap for the entire first run (albeit for the tail, with a hybrid quicksort/heap approach), but still use quicksort for every subsequent run, when it's clearly established that we aren't going to get one huge run. Is that correct? It was my understanding, based on your emphasis on producing only a single run, as well as your recent remarks on this thread about the first run being special, that you are really only interested in the presorted case, where one run is produced. That is, you are basically not interested in preserving the general ability of replacement selection to double run size in the event of a uniform distribution. (That particular doubling property of replacement selection is now technically lost by virtue of using this new hybrid model *anyway*, although it will still make runs longer in general). You don't want to change the behavior of the current patch for the second or subsequent run; that should remain a quicksort, pure and simple. Do I have that right? BTW, parallel sort should probably never use a heap anyway (ISTM that that will almost certainly be based on external sorts in the end). A heap is not really compatible with the parallel heap scan model. > One thing I just thought of (after the call) is that it might be > better for this GUC to be in units of tuples rather than in units of > memory; it's not clear to me why the optimal heap size should be > dependent on the tuple size, so we could have a threshold like 300,000 > tuples or whatever. I think you're right that a number of tuples is the logical way to express the heap size (as a GUC unit). I think that the ideal setting for the GUC is large enough to recognize significant correlations in input data, which may be clustered, but no larger (at least while things don't all fit in L1 cache, or maybe L2 cache). We should "go for broke" with replacement selection -- we don't aim for anything less than ending up with 1 run by using the heap (merging 2 or 3 runs rather than 4 or 6 is far less useful, maybe harmful, when one of them is much larger). Therefore, I don't expect that we'll be practically disadvantaged by having fewer "hands to juggle" tuples here (we'll simply almost always have enough in practice -- more on that later). FWIW I don't think that any benchmark we've seen so far justifies doing less than "going for broke" with RS, even if you happen to have a very conservative perspective. One advantage of a GUC is that you can set it to zero, and always get a simple hybrid sort-merge strategy if that's desirable. I think that it might not matter much with multi-gigabyte work_mem settings anyway, though; you'll just see a small blip. Big (maintenance_)work_mem was by far my greatest concern in relation to using a heap in general, so I'm left pretty happy by this plan, I think. Lots of people can afford a multi-GB maintenance_work_mem these days, and CREATE INDEX is gonna be the most important case overall, by far. > 2. If (maintenance_)work_mem fills up completely, we will quicksort > all of the data we have in memory. We will then regard the tail end > of that sorted data, in an amount governed by replacement_sort_mem, as > a heap, and use it to perform replacement selection until no tuples > remain for the current run. Meanwhile, the rest of the sorted data > remains in memory untouched. Logically, we're constructing a run of > tuples which is split between memory and disk: the head of the run > (what fits in all of (maintenance_)work_mem except for > replacement_sort_mem) is in memory, and the tail of the run is on > disk. I went back and forth on this during our call, but I now think that I was right that there will need to be changes in order to make the tail of the run a heap (*not* the quicksorted head), because routines like tuplesort_heap_siftup() assume that state->memtuples[0] is the head of the heap. This is currently assumed by the master branch for both the currentRun/nextRun replacement selection heap, as well as the heap used for merging. Chan
Re: [HACKERS] Using quicksort for every external sort run
On Thu, Feb 4, 2016 at 6:14 AM, Peter Geoghegan wrote: > The economics of using 4MB or even 20MB to sort 10GB of data are > already preposterously bad for everyone that runs a database server, > no matter how budget conscious they may be. I can reluctantly accept > that we need to still use a heap with very low work_mem settings to > avoid the risk of a regression (in the event of a strong correlation) > on general principle, but I'm well justified in proposing "just don't > do that" as the best practical advice. > > I thought I had your agreement on that point, Robert; is that actually the > case? Peter and I spent a few hours talking on Skype this morning about this point and I believe we have agreed on an algorithm that I think will address all of my concerns and hopefully also be acceptable to him. Peter, please weigh in and let me know if I've gotten anything incorrect here or if you think of other concerns afterwards. The basic idea is that we will add a new GUC with a name like replacement_sort_mem that will have a default value in the range of 20-30MB; or possibly we will hardcode this value, but for purposes of this email I'm going to assume it's a GUC. If the value of work_mem or maintenance_work_mem, whichever applies, is smaller than the value of replacement_sort_mem, then the latter has no effect. However, if replacement_sort_mem is the smaller value, then the amount of memory that can be used for a heap with replacement selection is limited to replacement_sort_mem: we can use more memory than that in total for the sort, but the amount that can be used for a heap is restricted to that value. The way we do this is explained in more detail below. One thing I just thought of (after the call) is that it might be better for this GUC to be in units of tuples rather than in units of memory; it's not clear to me why the optimal heap size should be dependent on the tuple size, so we could have a threshold like 300,000 tuples or whatever. But that's a secondary issue and I might be wrong about it: the point is that in order to have a chance of winning, a heap used for replacement selection needs to be not very big at all by the standards of modern hardware, so the plan is to limit it to a size at which it may have a chance. Here's how that will work, assuming Peter and I understand each other: 1. We start reading the input data. If we reach the end of the input data before (maintenance_)work_mem is exhausted, then we can simply quicksort the data and we're done. This is no different than what we already do today. 2. If (maintenance_)work_mem fills up completely, we will quicksort all of the data we have in memory. We will then regard the tail end of that sorted data, in an amount governed by replacement_sort_mem, as a heap, and use it to perform replacement selection until no tuples remain for the current run. Meanwhile, the rest of the sorted data remains in memory untouched. Logically, we're constructing a run of tuples which is split between memory and disk: the head of the run (what fits in all of (maintenance_)work_mem except for replacement_sort_mem) is in memory, and the tail of the run is on disk. 3. If we reach the end of input before replacement selection runs out of tuples for the current run, and if it finds no tuples for the next run prior to that time, then we are done. All of the tuples form a single run and we can return the tuples in memory first followed by the tuples on disk. This case is highly likely to be a huge win over what we have today, because (a) some portion of the tuples were sorted via quicksort rather than heapsort and that's faster, (b) the tuples that were sorted using a heap were sorted using a small heap rather than a big one, and (c) we only wrote out the minimal number of tuples to tape instead of, as we would have done today, all of them. 4. If we reach this step, then replacement selection with a small heap wasn't able to sort the input in a single run. We have a bunch of sorted data in memory which is the head of the same run whose tail is already on disk; we now spill all of these tuples to disk. That leaves only the heapified tuples in memory. We just ignore the fact that they are a heap and treat them as unsorted. We repeatedly do the following: read tuples until work_mem is full, sort them, and dump the result to disk as a run. When all runs have been created, we merge runs just as we do today. This algorithm seems very likely to beat what we do today in practically all cases. The benchmarking Peter and others have already done shows that building runs with quicksort rather than replacement selection can often win even if the larger number of tapes requires a multi-pass merge. The only cases where it didn't seem to be a clear win involved data that was already in sorted order, or very close to it. But with this algorithm, presorted input is fine: we'll quicksort some of it (which is faster than replacement selection because quick
Re: [HACKERS] Using quicksort for every external sort run
On Thu, Feb 4, 2016 at 1:46 AM, Peter Geoghegan wrote: > It wasn't my original insight that replacement selection has become > all but obsolete. It took me a while to come around to that point of > view. Nyberg et al may have said it best in 1994, in the Alphasort Paper [1]: "By comparison, OpenVMS sort uses a pure replacement-selection sort to generate runs (Knuth, 1973). Replacement-selection is best for a memory-constrained environment. On average, replacement-selection generates runs that are twice as large as available memory, while the QuickSort runs are typically less than half of available memory. However, in a memory-rich environment, QuickSort is faster because it is simpler, makes fewer exchanges on average, and has superior address locality to exploit processor caching. " (I believe that the authors state that "QuickSort runs are typically less than half of available memory" because of the use of explicit asynchronous I/O in each thread, which doesn't apply to us). The paper also has very good analysis of the economics of sorting: "Even for surprisingly large sorts, it is economical to perform the sort in one pass." Of course, memory capacities have scaled enormously in the 20 years since this analysis was performed, so the analysis applies even at the very low end these days. The high capacity memory system that they advocate to get a one pass sort (instead of having faster disks) had 100MB of memory, which is of course tiny by contemporary standards. If you pay Heroku $7 a month, you get a "Hobby Tier" database with 512MB of memory. The smallest EC2 instance size, the t2.nano, costs about $1.10 to run for one week, and has 0.5GB of memory. The economics of using 4MB or even 20MB to sort 10GB of data are already preposterously bad for everyone that runs a database server, no matter how budget conscious they may be. I can reluctantly accept that we need to still use a heap with very low work_mem settings to avoid the risk of a regression (in the event of a strong correlation) on general principle, but I'm well justified in proposing "just don't do that" as the best practical advice. I thought I had your agreement on that point, Robert; is that actually the case? [1] http://www.cs.berkeley.edu/~rxin/db-papers/alphasort.pdf -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Sat, Jan 30, 2016 at 5:29 AM, Robert Haas wrote: >> I meant use "quicksort with spillover" simply because an estimated >> 90%+ of all tuples have already been consumed. Don't consider the >> tuple width, etc. > > Hmm, it's a thought. To be honest, it's a bit annoying that this is one issue we're stuck on, because "quicksort with spillover" is clearly of less importance overall. (This is a distinct issue from the issue of not using a replacement selection style heap for the first run much of the time, which seems to be a discussion about whether and to what extent the *traditional* advantages of replacement selection hold today, as opposed to a discussion about a very specific crossover point in my patch.) >> Really? Just try it with a heap that is not tiny. Performance tanks. >> The fact that replacement selection can produce one long run then >> becomes a liability, not a strength. With a work_mem of something like >> 1GB, it's *extremely* painful. > > I'm not sure exactly what you think I should try. I think a couple of > people have expressed the concern that your patch might regress things > on data that is all in order, but I'm not sure if you think I should > try that case or some case that is not-quite-in-order. "I don't see > that you've justified that statement" is referring to the fact that > you presented no evidence in your original post that it's important to > sometimes use quicksorting even for run #1. If you've provided some > test data illustrating that point somewhere, I'd appreciate a pointer > back to it. I think that the answer to what you should try is simple: Any case involving a large heap (say, a work_mem of 1GB). No other factor like correlation seems to change the conclusion about that being generally bad. If you have a correlation, then that is *worse* if "quicksort with spillover" always has us use a heap for the first run, because it prolongs the pain of using the cache inefficient heap (note that this is an observation about "quicksort with spillover" in particular, and not replacement selection in general). The problem you'll see is that there is a large heap which is __slow__ to spill from, and that's pretty obvious with or without a correlation. In general it seems unlikely that having one long run during the merge (i.e. no merge -- seen by having the heap build one long run because we got "lucky" and "quicksort with spillover" encountered a correlation) can ever hope to make up for this. It *could* still make up for it if: 1. There isn't much to make up for in the first place, because the heap is CPU cache resident. Testing this with a work_mem that is the same size as CPU L3 cache seems a bit pointless to me, and I think we've seen that a few times. and: 2. There are many passes required without a replacement selection heap, because the volume of data is just so much greater than the low work_mem setting. Replacement selection makes the critical difference because there is a correlation, perhaps strong enough to make it one or two runs rather than, say, 10 or 20 or 100. I've already mentioned many times that linear growth in the size of work_mem sharply reduces the need for additional passes during the merge phase (the observation about quadratic growth that I won't repeat). These days, it's hard to recommend anything other than "use more memory" to someone trying to use 4MB to sort 10GB of data. Yeah, it would also be faster to use replacement selection for the first run in the hope of getting lucky (actually lucky this time; no quotes), but it's hard to imagine that that's going to be a better option, no matter how frugal the user is. Helping users recognize when they could use more memory effectively seems like the best strategy. That was the idea behind multipass_warning, but you didn't like that (Greg Stark was won over on the multipass_warning warning, though). I hope we can offer something roughly like that at some point (a view?), because it makes sense. > How about always starting with replacement selection, but limiting the > amount of memory that can be used with replacement selection to some > small value? It could be a separate GUC, or a hard-coded constant > like 20MB if we're fairly confident that the same value will be good > for everyone. If the tuples aren't in order, then we'll pretty > quickly come to the end of the first run and switch to quicksort. This seems acceptable, although note that we don't have to decide until we reach the work_mem limit, and not before. If you want to use a heap for the first run, I'm not excited about the idea, but if you insist then I'm glad that you at least propose to limit it to the kind of cases that we *actually* saw regressed (i.e. low work_mem settings -- like the default work_mem setting, 4MB). We've seen no actual case with a larger work_mem that is advantaged by using a heap, even *with* a strong correlation (this is actually *worst of all*); that's where I am determined to avoid using a he
Re: [HACKERS] Using quicksort for every external sort run
On Sat, Jan 30, 2016 at 2:25 AM, Peter Geoghegan wrote: > On Fri, Jan 29, 2016 at 2:58 PM, Robert Haas wrote: >> I don't quite know what you mean by these numbers. Add a generic, >> conservative threshold to what? > > I meant use "quicksort with spillover" simply because an estimated > 90%+ of all tuples have already been consumed. Don't consider the > tuple width, etc. Hmm, it's a thought. >> Thinking about this some more, I really think we should think hard >> about going back to the strategy which you proposed and discarded in >> your original post: always generate the first run using replacement >> selection, and every subsequent run by quicksorting. In that post you >> mention powerful advantages of this method: "even without a strong >> logical/physical correlation, the algorithm tends to produce runs that >> are about twice the size of work_mem. (It's also notable that >> replacement selection only produces one run with mostly presorted >> input, even where input far exceeds work_mem, which is a neat trick.)" >> You went on to dismiss that strategy, saying that "despite these >> upsides, replacement selection is obsolete, and should usually be >> avoided." But I don't see that you've justified that statement. > > Really? Just try it with a heap that is not tiny. Performance tanks. > The fact that replacement selection can produce one long run then > becomes a liability, not a strength. With a work_mem of something like > 1GB, it's *extremely* painful. I'm not sure exactly what you think I should try. I think a couple of people have expressed the concern that your patch might regress things on data that is all in order, but I'm not sure if you think I should try that case or some case that is not-quite-in-order. "I don't see that you've justified that statement" is referring to the fact that you presented no evidence in your original post that it's important to sometimes use quicksorting even for run #1. If you've provided some test data illustrating that point somewhere, I'd appreciate a pointer back to it. > A compromise that may be acceptable is to always do a "quicksort with > spillover" when there is a very low work_mem setting and the estimate > of the number of input tuples is less than 10x of what we've seen so > far. Maybe less than 20MB. That will achieve the same thing. How about always starting with replacement selection, but limiting the amount of memory that can be used with replacement selection to some small value? It could be a separate GUC, or a hard-coded constant like 20MB if we're fairly confident that the same value will be good for everyone. If the tuples aren't in order, then we'll pretty quickly come to the end of the first run and switch to quicksort. If we do end up using replacement selection for the whole sort, the smaller heap is an advantage. What I like about this sort of thing is that it adds no reliance on any estimate; it's fully self-tuning. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On 30 Jan 2016 8:27 am, "Greg Stark" wrote: > > > On 29 Jan 2016 11:58 pm, "Robert Haas" wrote: > > It > > seems pretty easy to construct cases where this technique regresses, > > and a large percentage of those cases are precisely those where > > replacement selection would have produced a single run, avoiding the > > merge step altogether. > > Now that avoiding the merge phase altogether didn't necessarily represent any actual advantage. > > We don't find out we've avoided the merge phase until the entire run has been spiked to disk. Hm, sorry about the phone typos. I thought I proofread it as I went but obviously not that effectively...
Re: [HACKERS] Using quicksort for every external sort run
On 29 Jan 2016 11:58 pm, "Robert Haas" wrote: > It > seems pretty easy to construct cases where this technique regresses, > and a large percentage of those cases are precisely those where > replacement selection would have produced a single run, avoiding the > merge step altogether. Now that avoiding the merge phase altogether didn't necessarily represent any actual advantage. We don't find out we've avoided the merge phase until the entire run has been spiked to disk. Then we need to read it back in from disk to serve up those tuples. If we have tapes to merge but can do then in a single pass we do that lazily and merge as needed when we serve up the tuples. I doubt there's any speed difference in reading two sequential streams with our buffering over one especially in the midst of a quiet doing other i/o. And N extra comparisons is less than the quicksort advantage. If we could somehow predict that it'll be a single output run that would be a huge advantage. But having to spill all the tuples and then find out isn't really helpful.
Re: [HACKERS] Using quicksort for every external sort run
On Fri, Jan 29, 2016 at 2:58 PM, Robert Haas wrote: > I don't quite know what you mean by these numbers. Add a generic, > conservative threshold to what? I meant use "quicksort with spillover" simply because an estimated 90%+ of all tuples have already been consumed. Don't consider the tuple width, etc. > Thinking about this some more, I really think we should think hard > about going back to the strategy which you proposed and discarded in > your original post: always generate the first run using replacement > selection, and every subsequent run by quicksorting. In that post you > mention powerful advantages of this method: "even without a strong > logical/physical correlation, the algorithm tends to produce runs that > are about twice the size of work_mem. (It's also notable that > replacement selection only produces one run with mostly presorted > input, even where input far exceeds work_mem, which is a neat trick.)" > You went on to dismiss that strategy, saying that "despite these > upsides, replacement selection is obsolete, and should usually be > avoided." But I don't see that you've justified that statement. Really? Just try it with a heap that is not tiny. Performance tanks. The fact that replacement selection can produce one long run then becomes a liability, not a strength. With a work_mem of something like 1GB, it's *extremely* painful. > It seems pretty easy to construct cases where this technique regresses, > and a large percentage of those cases are precisely those where > replacement selection would have produced a single run, avoiding the > merge step altogether. ...*and* where many passes are otherwise required (otherwise, the merge is still cheap enough to leave us ahead). Typically with very small work_mem settings, like 4MB, and far larger data volumes. It's easy to construct those cases, but that doesn't mean that they particularly matter. Using 4MB of work_mem to sort 10GB of data is penny wise and pound foolish. The cases we've seen regressed are mostly a concern because misconfiguration happens. A compromise that may be acceptable is to always do a "quicksort with spillover" when there is a very low work_mem setting and the estimate of the number of input tuples is less than 10x of what we've seen so far. Maybe less than 20MB. That will achieve the same thing. > I'm quite willing to run somewhat more slowly than in other cases to > be certain of not regressing the case of completely or > almost-completely ordered input. Even if that didn't seem like a > sufficient reason unto itself, I'd be willing to go that way just so > we don't have to depend on a cost model that might easily go wrong due > to bad input even if it were theoretically perfect in every other > respect (which I'm pretty sure is not true here anyway). The consequences of being wrong either way are not severe (note that making one long run isn't a goal of the cost model currently). > I also have another idea that might help squeeze more performance out > of your approach and avoid regressions. Suppose that we add a new GUC > with a name like sort_mem_stretch_multiplier or something like that, > with a default value of 2.0 or 4.0 or whatever we think is reasonable. > When we've written enough runs that a polyphase merge will be > required, or when we're actually performing a polyphase merge, the > amount of memory we're allowed to use increases by this multiple. The > idea is: we hope that people will set work_mem appropriately and > consequently won't experience polyphase merges at all, but it might. > However, it's almost certain not to happen very frequently. > Therefore, using extra memory in such cases should be acceptable, > because while you might have every backend in the system using 1 or > more copies of work_mem for something if the system is very busy, it > is extremely unlikely that you will have more than a handful of > processes doing polyphase merges. I'm not sure that that's practical. Currently, tuplesort decides on a number of tapes ahead of time. When we're constrained on those, the stretch multiplier would apply, but I think that that could be invasive because the number of tapes ("merge order" + 1) was a function of non-stretched work_mem. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Fri, Jan 29, 2016 at 12:46 PM, Peter Geoghegan wrote: > On Fri, Jan 29, 2016 at 9:24 AM, Robert Haas wrote: >> I feel like this could be data driven. I mean, the cost model is >> based mainly on the tuple width and the size of the SortTuple array. >> So, it should be possible to tests of both algorithms on 32, 64, 96, >> 128, ... byte tuples with a SortTuple array that is 256MB, 512MB, >> 768MB, 1GB, ... Then we can judge how closely the cost model comes to >> mimicking the actual behavior. > > You would also need to represent how much of the input actually ended > up being sorted with the heap in each case. Maybe that could be tested > at 50% (bad for "quicksort with spillover"), 25% (better), and 5% > (good). > > An alternative approach that might be acceptable is to add a generic, > conservative 90% threshold (so 10% of tuples sorted by heap). I don't quite know what you mean by these numbers. Add a generic, conservative threshold to what? Thinking about this some more, I really think we should think hard about going back to the strategy which you proposed and discarded in your original post: always generate the first run using replacement selection, and every subsequent run by quicksorting. In that post you mention powerful advantages of this method: "even without a strong logical/physical correlation, the algorithm tends to produce runs that are about twice the size of work_mem. (It's also notable that replacement selection only produces one run with mostly presorted input, even where input far exceeds work_mem, which is a neat trick.)" You went on to dismiss that strategy, saying that "despite these upsides, replacement selection is obsolete, and should usually be avoided." But I don't see that you've justified that statement. It seems pretty easy to construct cases where this technique regresses, and a large percentage of those cases are precisely those where replacement selection would have produced a single run, avoiding the merge step altogether. I think those cases are extremely important. I'm quite willing to run somewhat more slowly than in other cases to be certain of not regressing the case of completely or almost-completely ordered input. Even if that didn't seem like a sufficient reason unto itself, I'd be willing to go that way just so we don't have to depend on a cost model that might easily go wrong due to bad input even if it were theoretically perfect in every other respect (which I'm pretty sure is not true here anyway). I also have another idea that might help squeeze more performance out of your approach and avoid regressions. Suppose that we add a new GUC with a name like sort_mem_stretch_multiplier or something like that, with a default value of 2.0 or 4.0 or whatever we think is reasonable. When we've written enough runs that a polyphase merge will be required, or when we're actually performing a polyphase merge, the amount of memory we're allowed to use increases by this multiple. The idea is: we hope that people will set work_mem appropriately and consequently won't experience polyphase merges at all, but it might. However, it's almost certain not to happen very frequently. Therefore, using extra memory in such cases should be acceptable, because while you might have every backend in the system using 1 or more copies of work_mem for something if the system is very busy, it is extremely unlikely that you will have more than a handful of processes doing polyphase merges. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Fri, Jan 29, 2016 at 9:24 AM, Robert Haas wrote: > I feel like this could be data driven. I mean, the cost model is > based mainly on the tuple width and the size of the SortTuple array. > So, it should be possible to tests of both algorithms on 32, 64, 96, > 128, ... byte tuples with a SortTuple array that is 256MB, 512MB, > 768MB, 1GB, ... Then we can judge how closely the cost model comes to > mimicking the actual behavior. You would also need to represent how much of the input actually ended up being sorted with the heap in each case. Maybe that could be tested at 50% (bad for "quicksort with spillover"), 25% (better), and 5% (good). An alternative approach that might be acceptable is to add a generic, conservative 90% threshold (so 10% of tuples sorted by heap). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Wed, Jan 27, 2016 at 8:20 AM, Peter Geoghegan wrote: > Correct me if I'm wrong, but I think that the only outstanding issue > with all patches posted here so far is the "quicksort with spillover" > cost model. Hopefully this can be cleared up soon. As I've said, I am > very receptive to other people's suggestions about how that should > work. I feel like this could be data driven. I mean, the cost model is based mainly on the tuple width and the size of the SortTuple array. So, it should be possible to tests of both algorithms on 32, 64, 96, 128, ... byte tuples with a SortTuple array that is 256MB, 512MB, 768MB, 1GB, ... Then we can judge how closely the cost model comes to mimicking the actual behavior. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Fri, Jan 29, 2016 at 3:41 AM, Mithun Cy wrote: > I just ran some tests on above patch. Mainly to compare > how "longer sort keys" would behave with new(Qsort) and old Algo(RS) for > sorting. > I have 8GB of ram and ssd storage. > > Settings and Results. > > Work_mem= DEFAULT (4mb). > key width = 520. > If data is sorted as same as sort key order then current code performs better > than proposed patch > as sort size increases. > > It appears new algo do not seem have any major impact if rows are presorted > in opposite order. > > For randomly distributed order quick sort performs well when compared to > current sort method (RS). > > > == > Now Increase the work_mem to 64MB and for 14 GB of data to sort. > > CASE 1: We can see Qsort is able to catchup with current sort method(RS). > CASE 2: No impact. > CASE 3: RS is able to catchup with Qsort. I think that the basic method you're using to do these tests may have additional overhead: -- sort in ascending order. CREATE FUNCTION test_orderby_asc( ) RETURNS int AS $$ #print_strict_params on DECLARE gs int; jk text; BEGIN SELECT string_4k, generate_series INTO jk, gs FROM so order by string_4k, generate_series; RETURN gs; END $$ LANGUAGE plpgsql; Anyway, these test cases all remove much of the advantage of increased cache efficiency. No comparisons are *ever* resolved using the leading attribute, which calls into question why anyone would sort on that. It's 512 bytes, so artificially makes the comparisons themselves the bottleneck, as opposed to cache efficiency. You can't even fit the second attribute in the same cacheline as the first in the "tuple proper" (MinimalTuple). You are using a 4MB work_mem setting, but you almost certainly have a CPU with an L3 cache size that's a multiple of that, even with cheap consumer grade hardware. You have 8GB of ram; a 4MB work_mem setting is very small setting (I mean in an absolute sense, less so than relative to the size of data, although especially relative to the data). You mentioned "CASE 3: RS is able to catchup with Qsort", which doesn't make much sense to me. The only way I think that is possible is by making the increased work_mem sufficient to have much longer runs, because there is in fact somewhat of a correlation in the data, and an increased work_mem makes the critical difference, allowing perhaps one long run to be used -- there is now enough memory to "juggle" tuples without ever needing to start a new run. But, how could that be? You said case 3 was totally random data, so I'd only expect incremental improvement. It could also be some weird effect from polyphase merge. A discontinuity. I also don't understand why the patch ("Qsort") can be so much slower between case 1 and case 3 on 3.5GB+ sizes, but not the 1.7GB size. Even leaving aside the differences between "RS" and "Qsort", it makes no sense to me that *both* are faster with random data ("CASE 3") than with presorted data ("CASE 1"). Another weird thing is that the traditional best case for replacement selection ("RS") is a strong correlation, and a traditional worst case is an inverse correlation, where run size is bound strictly by memory. But you show just the opposite here -- the inverse correlation is faster with RS in the 1.7 GB data case. So, I have no idea what's going on here, and find it all very confusing. In order for these numbers to be useful, they need more detail -- "trace_sort" output. There are enough confounding factors in general, and especially here, that not having that information makes raw numbers very difficult to interpret. > I think for long keys both old (RS) and new (Qsort) sort method has its own > characteristics > based on data distribution. I think work_mem is the key If properly set new > method(Qsort) will > be able to fit most of the cases. If work_mem is not tuned right it, there > are cases it can regress. work_mem is impossible to tune right with replacement selection. That's a key advantage of the proposed new approach. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Fri, Jan 29, 2016 at 5:11 PM, Mithun Cy wrote > > > > >I just ran some tests on above patch. Mainly to compare > >how "longer sort keys" would behave with new(Qsort) and old Algo(RS) for > sorting. > >I have 8GB of ram and ssd storage. > > > *Key length 520* > > > > > *Number of records* 320 640 1280 2560 > > 1.7 GB 3.5GB 7 GB 14GB > > > > > > *CASE 1* > > > > *RS* 23654.677 35172.811 44965.442 106420.155 > *Qsort* 14100.362 40612.829 101068.107 334893.391 > > > > > > *CASE 2* > > > > *RS* 13427.378 36882.898 98492.644 310670.15 > *Qsort* 12475.133 32559.074 100772.531 322080.602 > > > > > > *CASE 3* > > > > *RS* 17202.966 45163.234 122323.299 337058.856 > *Qsort* 12530.726 23343.753 59431.315 152862.837 > > > > *CASE 1* *RS* 128822.735 > > *Qsort* 90857.496 > *CSAE 2* *RS* *105631.775* > > *Qsort* *105938.334* > *CASE 3* *RS* *152301.054* > > *Qsort* *149649.347* > > Sorry forgot to mention above data in table is in unit of ms, returned by psql client. -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Using quicksort for every external sort run
On Tue, Dec 29, 2015 at 4:33 AM, Peter Geoghegan wrote: >Attached is a revision that significantly overhauls the memory patch, >with several smaller changes. I just ran some tests on above patch. Mainly to compare how "longer sort keys" would behave with new(Qsort) and old Algo(RS) for sorting. I have 8GB of ram and ssd storage. Settings and Results. Work_mem= DEFAULT (4mb). key width = 520. CASE 1. Data is pre-sorted as per sort key order. CASE 2. Data is sorted in opposite order of sort key. CASE 3. Data is randomly distributed. *Key length 520* *Number of records* 320 640 1280 2560 1.7 GB 3.5GB 7 GB 14GB *CASE 1* *RS* 23654.677 35172.811 44965.442 106420.155 *Qsort* 14100.362 40612.829 101068.107 334893.391 *CASE 2* *RS* 13427.378 36882.898 98492.644 310670.15 *Qsort* 12475.133 32559.074 100772.531 322080.602 *CASE 3* *RS* 17202.966 45163.234 122323.299 337058.856 *Qsort* 12530.726 23343.753 59431.315 152862.837 If data is sorted as same as sort key order then current code performs better than proposed patch as sort size increases. It appears new algo do not seem have any major impact if rows are presorted in opposite order. For randomly distributed order quick sort performs well when compared to current sort method (RS). == Now Increase the work_mem to 64MB and for 14 GB of data to sort. CASE 1: We can see Qsort is able to catchup with current sort method(RS). CASE 2: No impact. CASE 3: RS is able to catchup with Qsort. *CASE 1* *RS* 128822.735 *Qsort* 90857.496 *CSAE 2* *RS* *105631.775* *Qsort* *105938.334* *CASE 3* *RS* *152301.054* *Qsort* *149649.347* I think for long keys both old (RS) and new (Qsort) sort method has its own characteristics based on data distribution. I think work_mem is the key If properly set new method(Qsort) will be able to fit most of the cases. If work_mem is not tuned right it, there are cases it can regress. -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com Test Queries.sql Description: application/sql -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Wed, Dec 23, 2015 at 9:37 AM, Robert Haas wrote: >> But yes, let me concede more clearly: the cost model is based on >> frobbing. But at least it's relatively honest about that, and is >> relatively simple. I think it might be possible to make it simpler, >> but I have a feeling that anything we can come up with will basically >> have the same quality that you so dislike. I don't know how to do >> better. Frankly, I'd rather be roughly correct than exactly wrong. > > Sure, but the fact that the model has huge discontinuities - perhaps > most notably a case where adding a single tuple to the estimated > cardinality changes the crossover point by a factor of two - suggests > that you are probably wrong. The actual behavior does not change > sharply when the size of the SortTuple array crosses 1GB, but the > estimates do. Here is some fairly interesting analysis of Quicksort vs. Heapsort, from Bentley, coauthor of our own Quicksort implementation: https://youtu.be/QvgYAQzg1z8?t=16m15s (This link picks up at the right point to see the comparison, complete with an interesting graph). It probably doesn't tell you much that you didn't already know, at least at this exact point, but it's nice to see Bentley's graph. This perhaps gives you some idea of why my "quicksort with spillover" cost model had a cap of MaxAllocSize of SortTuples, past which we always needed a very compelling case. That was my rough guess of where the Heapsort graph takes a sharp upward turn. Before then, Bentley shows that it's close enough to a straight line. Correct me if I'm wrong, but I think that the only outstanding issue with all patches posted here so far is the "quicksort with spillover" cost model. Hopefully this can be cleared up soon. As I've said, I am very receptive to other people's suggestions about how that should work. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Fri, Dec 18, 2015 at 11:57 AM, Peter Geoghegan wrote: > BTW, I'm not necessarily determined to make the new special-purpose > allocator work exactly as proposed. It seemed useful to prioritize > simplicity, and currently so there is one big "huge palloc()" with > which we blow our memory budget, and that's it. However, I could > probably be more clever about "freeing ranges" initially preserved for > a now-exhausted tape. That kind of thing. Attached is a revision that significantly overhauls the memory patch, with several smaller changes. We can now grow memtuples to rebalance the size of the array (memtupsize) against the need for memory for tuples. Doing this makes a big difference with a 500MB work_mem setting in this datum sort case, as my newly expanded trace_sort instrumentation shows: LOG: grew memtuples 1.40x from 9362286 (219429 KB) to 13107200 (307200 KB) for final merge LOG: tape 0 initially used 34110 KB of 34110 KB batch (1.000) and 13107200 slots remaining LOG: tape 1 initially used 34110 KB of 34110 KB batch (1.000) and has 1534 slots remaining LOG: tape 2 initially used 34110 KB of 34110 KB batch (1.000) and has 1535 slots remaining LOG: tape 3 initially used 34110 KB of 34110 KB batch (1.000) and has 1533 slots remaining LOG: tape 4 initially used 34110 KB of 34110 KB batch (1.000) and has 1534 slots remaining LOG: tape 5 initially used 34110 KB of 34110 KB batch (1.000) and has 1535 slots remaining This is a big improvement. With the new batchmemtuples() call commented out (i.e. no new grow_memtuples() call), the LOG output around the same point is: LOG: tape 0 initially used 24381 KB of 48738 KB batch (0.500) and has 1 slots remaining LOG: tape 1 initially used 24381 KB of 48738 KB batch (0.500) and has 1 slots remaining LOG: tape 2 initially used 24381 KB of 48738 KB batch (0.500) and has 1 slots remaining LOG: tape 3 initially used 24381 KB of 48738 KB batch (0.500) and has 1 slots remaining LOG: tape 4 initially used 24381 KB of 48738 KB batch (0.500) and has 1 slots remaining LOG: tape 5 initially used 24381 KB of 48738 KB batch (0.500) and has 1 slots remaining (I actually added a bit more detail to what you see here during final clean-up) Obviously we're using memory a lot more efficiently here as compared to my last revision (or the master branch -- it always has palloc() overhead, of course). With no grow_memtuples, we're not wasting ~1530 slots per tape anymore (which is a tiny fraction of 1% of the total), but we are wasting 50% of all batch memory, or almost 30% of all work_mem. Note that this improvement is possible despite the fact that memory is still MAXALIGN()'d -- I'm mostly just clawing back what I can, having avoided much STANDARDCHUNKHEADERSIZE overhead for the final on-the-fly merge. I tend to think that the bigger problem here is that we use so many memtuples when merging in the first place though (e.g. 60% in the above case), because memtuples are much less useful than something like a simple array of pointers when merging; I can certainly see why you'd need 6 memtuples here, for the merge heap, but the other ~13 million seem mostly unnecessary. Anyway, what I have now is as far as I want to go to accelerate merging for 9.6, since parallel CREATE INDEX is where the next big win will come from. As wasteful as this can be, I think it's of secondary importance. With this revision, I've given up on the idea of trying to map USEMEM()/FREEMEM() to "logical" allocations and deallocations that consume from each tape's batch. The existing merge code in the master branch is concerned exclusively with making each tape's use of memory fair; each tape only gets so many "slots" (memtuples), and so much memory, and that's it (there is never any shuffling of those resource budgets between tapes). I get the same outcome from simply only allowing tapes to get memory from their own batch allocation, which isn't much complexity, because only READTUP() routines regularly need memory. We detect when memory has been exhausted within mergeprereadone() in a special way, not using LACKMEM() at all -- this seems simpler. (Specifically, we use something called overflow allocations for this purpose. This means that there are still a very limited number of retail palloc() calls.) This new version somewhat formalizes the idea that batch allocation may one day have uses beyond the final on-the-fly merge phase, which makes a lot of sense. We should really be saving a significant amount of memory when initially sorting runs, too. This revision also pfree()s tape memory early if the tape is exhausted early, which will help a lot when there is a logical/physical correlation. Overall, I'm far happier with how memory is managed in this revision, mostly because it's easier to reason about. trace_sort now closely monitors where memory goes, and I think that's a good idea in general. That makes production performance problems a lot easier to reason about -- the accounting shou
Re: [HACKERS] Using quicksort for every external sort run
On Wed, Dec 23, 2015 at 7:48 PM, Peter Geoghegan wrote: > I wonder, what's the situation here like with the attached patch > applied on top of what you were testing? I think that we might be > better off with more merge steps when under enormous memory pressure > at the low end, in order to be able to store more tuples per tape (and > do more sorting using quicksort). Actually, now that I look into it, I think your 64MB work_mem setting would have 234 tapes in total, so my patch won't do anything for your case. Maybe change MAXORDER to 100 within the patch, to see where that leaves things? I want to see if there is any improvement. 234 tapes means that approximately 5.7MB of memory would go to just using tapes (for accounting purposes, which is mostly my concern here). However, for a case like this, where you're well short of being able to do everything in one pass, there is no benefit to having more than about 6 tapes (I guess that's probably still true these days). That 5.7MB of tape space for accounting purposes (and also in reality) may not only increase the amount of random I/O required, and not only throw off the memtuples estimate within grow_memtuples() (its balance against everything else), but also decrease the cache efficiency in the final on-the-fly merge (the efficiency in accessing tuples). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Wed, Dec 23, 2015 at 1:03 PM, Jeff Janes wrote: > The regression I found when building an index on a column of > 400,000,000 md5(random()::text) with 64MB maintenance_work_mem was not > hard to find at all. I still don't understand what is going on with > it, but it is reproducible. Perhaps it is very unlikely and I just > got very lucky in finding it immediately after switching to that > data-type for my tests, but I wouldn't assume that on current > evidence. Well, that is a lot of tuples to sort with such a small amount of memory. I have a new theory. Maybe part of the problem here is that in very low memory conditions, the tape overhead really is kind of wasteful, and we're back to having to worry about per-tape overhead (6 tapes may have been far too miserly as a universal number back before that was fixed [1], but that doesn't mean that the per-tape overhead is literally zero). You get a kind of thrashing, perhaps. Also, more tapes results in more random I/O, and that's an added cost, too; the cure may be worse than the disease. I also think that this might be a problem in your case: * In this calculation we assume that each tape will cost us about 3 blocks * worth of buffer space (which is an underestimate for very large data * volumes, but it's probably close enough --- see logtape.c). I wonder, what's the situation here like with the attached patch applied on top of what you were testing? I think that we might be better off with more merge steps when under enormous memory pressure at the low end, in order to be able to store more tuples per tape (and do more sorting using quicksort). I also think that under conditions such as you describe, this code may play havoc with memory accounting: /* * Decrease availMem to reflect the space needed for tape buffers; but * don't decrease it to the point that we have no room for tuples. (That * case is only likely to occur if sorting pass-by-value Datums; in all * other scenarios the memtuples[] array is unlikely to occupy more than * half of allowedMem. In the pass-by-value case it's not important to * account for tuple space, so we don't care if LACKMEM becomes * inaccurate.) */ tapeSpace = (int64) maxTapes *TAPE_BUFFER_OVERHEAD; if (tapeSpace + GetMemoryChunkSpace(state->memtuples) < state->allowedMem) USEMEM(state, tapeSpace); Remember, this is after the final grow_memtuples() call that uses your intelligent resizing logic [2], so we'll USEMEM() in a way that effectively makes some non-trivial proportion of our optimal memtuples sizing unusable. Again, that could be really bad for cases like yours, with very little memory relatively to data volume. Thanks [1] Commit df700e6b4 [2] Commit 8ae35e918 -- Peter Geoghegan diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index cf1cdcb..bd8c0ee 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -201,6 +201,7 @@ typedef enum * tape during a preread cycle (see discussion at top of file). */ #define MINORDER 6 /* minimum merge order */ +#define MAXORDER 250 /* maximum merge order */ #define TAPE_BUFFER_OVERHEAD (BLCKSZ * 3) #define MERGE_BUFFER_SIZE (BLCKSZ * 32) @@ -2066,8 +2067,15 @@ tuplesort_merge_order(int64 allowedMem) mOrder = (allowedMem - TAPE_BUFFER_OVERHEAD) / (MERGE_BUFFER_SIZE + TAPE_BUFFER_OVERHEAD); - /* Even in minimum memory, use at least a MINORDER merge */ + /* + * Even in minimum memory, use at least a MINORDER merge. At the same + * time, cap the maximum merge order quasi-arbitrarily. With more than + * several hundred tapes, the overhead per-tape is likely to become a + * concern, since that will only happen in the event of severe memory + * pressure. + */ mOrder = Max(mOrder, MINORDER); + mOrder = Min(mOrder, MAXORDER); return mOrder; } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Thu, Dec 24, 2015 at 8:44 AM, Peter Geoghegan wrote: > [long blahblah] (Patch moved to next CF, work is going on. Thanks to people here to be active) -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Wed, Dec 23, 2015 at 1:16 PM, Robert Haas wrote: > On Wed, Dec 23, 2015 at 3:31 PM, Peter Geoghegan wrote: >> On Wed, Dec 23, 2015 at 9:37 AM, Robert Haas wrote: >>> The point is, nobody can tell WHAT effects this is modeling. >>> Increasing the tuple size makes the crossover go up. Or down. >> >> There are multiple, competing considerations. > > Please explain what they are and how they lead you to believe that the > cost factors you have chosen are good ones. Alright. I've gone on at length about how I'm blurring the distinction between internal and external sorting, or about how modern hardware characteristics allow that. There are several reasons for that. Now, we all know that main memory sizes have increased dramatically since the 1970s, and storage characteristics are very different, and that CPU caching effects have become very important, and that everyone has lots more data. There is one thing that hasn't really become bigger in all that time, though: the width of tuples. So, as I go into in comments within useselection(), that's the main reason why avoiding I/O isn't all that impressive, especially at the high end. It's just not that big of a cost at the high end. Beyond that, as linear costs go, palloc() is a much bigger concern to me at this point. I think we can waste a lot less time by amortizing that more extensively (to say nothing of the saving in memory). This is really obvious by just looking at trace_sort output with my patch applied when dealing with many runs, sorting millions of tuples: There just isn't that much time spent on I/O at all, and it's well hidden by foreground processing that is CPU bound. With smaller work_mem sizes and far fewer tuples, a case much more common within sort nodes (as opposed to utility statements), this is less true. Sorting 1,000 or 10,000 tuples is an entirely different thing to sorting 1,000,000 tuples. So, first of all, the main consideration is that saving I/O turns out to not matter that much at the high end. That's why we get very conservative past the fairly arbitrary MaxAllocSize memtuples threshold (which has a linear relationship to the number of tuples -- *not* the amount of memory used or disk space that may be used). A second consideration is how much I/O we can save -- one would hope it would be a lot, certainly the majority, to make up for the downside of using a cache inefficient technique. That is a different thing to the number of memtuples. If you had really huge tuples, there would be a really big saving in I/O, often without a corresponding degradation in cache performance (since there still many not be that many memtuples, which is more the problem for the heap than anything else). This distinction is especially likely to matter for the CLUSTER case, where wide heap tuples (including heap tuple headers, visibility info) are kind of along for the ride, which is less true elsewhere, particularly for the CREATE INDEX case. The cache inefficiency of spilling incrementally from a heap isn't so bad if we only end up sorting a small number of tuples that way. So as the number of tuples that we end up actually sorting that way increases, the cache inefficiency becomes worse, while at the same time, we save less I/O. The former is a bigger problem than the latter, by a wide margin, I believe. This code is an attempt to credit cases with really wide tuples: /* * Starting from a threshold of 90%, refund 7.5% per 32 byte * average-size-increment. */ increments = MAXALIGN_DOWN((int) avgTupleSize) / 32; crossover = 0.90 - (increments * 0.075); Most cases won't get too many "increments" of credit (although CLUSTER sorts will probably get relatively many). A third consideration is that we should be stingy about giving too much credit to wider tuples because the cache inefficiency hurts more as we achieve mere linear savings in I/O. So, most of the savings off a 99.99% theoretical baseline threshold are fixed (you usually save 9.99% off that up-front). A forth consideration is that the heap seems to do really badly past 1GB in general, due to cache characteristics. This is certainly not something that I know how to model well. I don't blame you for calling this voodoo, because to some extent it is. But I remind you that the consequences of making the wrong decision here are still better than the status quo today -- probably far better, overall. I also remind you that voodoo code is something you'll find in well regarded code bases at times. Have you ever written networking code? Packet switching is based on some handwavy observations about the real world. Practical implementations often contain voodoo magic numbers. So, to answer your earlier question: Yes, maybe it wouldn't be so bad, all things considered, to let someone complain about this if they have a real-world problem with it. The complexity of what we're talking about makes me modest about my ability to get it exactly right. At the same time, t
Re: [HACKERS] Using quicksort for every external sort run
On Wed, Dec 23, 2015 at 3:31 PM, Peter Geoghegan wrote: > On Wed, Dec 23, 2015 at 9:37 AM, Robert Haas wrote: >> The point is, nobody can tell WHAT effects this is modeling. >> Increasing the tuple size makes the crossover go up. Or down. > > There are multiple, competing considerations. Please explain what they are and how they lead you to believe that the cost factors you have chosen are good ones. My point here is: even if I were to concede that your cost model yields perfect answers in every case, the patch needs to give at least some hint as to why. Right now, it really doesn't. >>> Another factor is that the heap could be useful for other stuff in the >>> future. As Simon Riggs pointed out, for deduplicating values as >>> they're read in by tuplesort. (Okay, that's really the only other >>> thing, but it's a good one). >> >> Not sure how that would work? > > Tuplesort would have license to discard tuples with matching existing > values, because the caller gave it permission to. This is something > that you can easily imagine occurring with ordered set aggregates, for > example. It would work in a way not unlike a top-N heapsort does > today. This would work well when it can substantially lower the use of > memory (initially heapification when the threshold is crossed would > probably measure the number of duplicates, and proceed only when it > looked like a promising strategy). It's not clear to me how having a heap helps with that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Wed, Dec 23, 2015 at 1:03 PM, Jeff Janes wrote: > If we do think it is important to almost never cause regressions at > the default maintenance_work_mem (I am agnostic on the importance of > that), then I think we have more work to do here. I just don't know > what that work is. My next revision will use grow_memtuples() in advance of the final on-the-fly merge step, in a way that considers that we won't be losing out to palloc() overhead (so it'll mostly be the memory patch that is revised). This can make a large difference to the number of slots (memtuples) available. I think I measured a 6% or 7% additional improvement for a case with a fairly small number of runs to merge. It might help significantly more when there are more runs to merge. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Mon, Dec 14, 2015 at 7:22 PM, Peter Geoghegan wrote: > On Mon, Dec 14, 2015 at 6:58 PM, Greg Stark wrote: >> I ran sorts with various parameters on my small NAS server. > > ... > >> without the extra memory optimizations. > > Thanks for taking the time to benchmark the patch! > > While I think it's perfectly fair that you didn't apply the final > on-the-fly merge "memory pool" patch, I also think that it's quite > possible that the regression you see at the very low end would be > significantly ameliorated or even eliminated by applying that patch, > too. After all, Jeff Janes had a much harder time finding a > regression, probably because he benchmarked all patches together. The regression I found when building an index on a column of 400,000,000 md5(random()::text) with 64MB maintenance_work_mem was not hard to find at all. I still don't understand what is going on with it, but it is reproducible. Perhaps it is very unlikely and I just got very lucky in finding it immediately after switching to that data-type for my tests, but I wouldn't assume that on current evidence. If we do think it is important to almost never cause regressions at the default maintenance_work_mem (I am agnostic on the importance of that), then I think we have more work to do here. I just don't know what that work is. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Wed, Dec 23, 2015 at 9:37 AM, Robert Haas wrote: > The point is, nobody can tell WHAT effects this is modeling. > Increasing the tuple size makes the crossover go up. Or down. There are multiple, competing considerations. > This analogy is faulty. It's true that when we run across a qual > whose selectivity we cannot estimate in any meaningful way, we have to > just take a stab in the dark and hope for the best. Similarly, if we > have no information about what the crossover point for a given sort > is, we'd have to take some arbitrary estimate, like 75%, and hope for > the best. But in this case, we DO have information. We have an > estimated row count and an estimated row width. And those values are > not being ignored, they are getting used. The problem is that they > are being used in an arbitrary way that is not justified by any chain > of reasoning. There is a chain of reasoning. It's not particularly satisfactory that it's so fuzzy, certainly, but the competing considerations here are substantive (and include erring towards not proceeding with replacement selection/"quicksort with spillover" when the benefits are low relative to the costs, which, to repeat myself, is itself novel). I am more than open to suggestions on alternatives. As I said, I don't particularly care for my current approach, either. But doing something analogous to cost_sort() for our private "Do we quicksort with spillover?"/useselection() model is going to be strictly worse than what I have proposed. Any cost model will have to be sensitive to different types of CPU costs at the level that matters here -- such as the size of the heap, and its cache efficiency. That's really important, but very complicated, and variable enough that erring against using replacement selection seems like a good idea with bigger heaps especially. That (cache efficiency) is theoretically the only difference that matters here (other than I/O, of course, but avoiding I/O is only the upside of proceeding, and if we only weigh that then the cost model always gives the same answer). Perhaps you can suggest an alternative model that weighs these factors. Most sorts are less than 1GB, and it seems worthwhile to avoid I/O at the level where an internal sort is just out of reach. Really big CREATE INDEX sorts are not really what I have in mind with "quicksort with spillover". This cost_sort() code seems pretty bogus to me, FWIW: /* Assume 3/4ths of accesses are sequential, 1/4th are not */ startup_cost += npageaccesses * (seq_page_cost * 0.75 + random_page_cost * 0.25); I think we can afford to be a lot more optimistic about the proportion of sequential accesses. >> Merging is still sorting. The 10 tuples are not very cheap to merge >> against the 1000 tuples, because you'll probably still end up reading >> most of the 1000 tuples to do so. > > You're going to read all of the 1000 tuples no matter what, because > you need to return them, but you will also need to make comparisons on > most of them, unless the data distribution is favorable. Assuming no > special good luck, it'll take something close to X + Y - 1 comparisons > to do the merge, so something around 1009 comparisons here. > Maintaining the heap property is not free either, but it might be > cheaper. I'm pretty sure that it's cheaper. Some of the really good cases for "quicksort with spillover" where only a little bit slower than a fully internal sort when the work_mem threshold was just crossed. >> Another factor is that the heap could be useful for other stuff in the >> future. As Simon Riggs pointed out, for deduplicating values as >> they're read in by tuplesort. (Okay, that's really the only other >> thing, but it's a good one). > > Not sure how that would work? Tuplesort would have license to discard tuples with matching existing values, because the caller gave it permission to. This is something that you can easily imagine occurring with ordered set aggregates, for example. It would work in a way not unlike a top-N heapsort does today. This would work well when it can substantially lower the use of memory (initially heapification when the threshold is crossed would probably measure the number of duplicates, and proceed only when it looked like a promising strategy). By the way, I think the heap currently does quite badly with many duplicated values. That case seemed significantly slower than a similar case with high cardinality tuples. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Tue, Dec 22, 2015 at 8:10 PM, Peter Geoghegan wrote: >> Sure, there are arbitrary numbers all over the code, driven by >> empirical observations about what factors are important to model. But >> this is not that. You don't have a thing called seq_page_cost and a >> thing called cpu_tuple_cost and then say, well, empirically the ratio >> is about 100:1, so let's make the former 1 and the latter 0.01. You >> just have some numbers, and it's not clear what, if anything, they >> actually represent. > > What I find difficult to accept about what you say here is that at > *this* level, something like cost_sort() has little to recommend it. > It costs a sort of a text attribute at the same level as the cost of > sorting the same tuples using an int4 attribute (based on the default > cpu_operator_cost for C functions -- without any attempt to > differentiate text and int4). > > Prior to 9.5, sorting text took about 5 - 10 times longer that this > similar int4 sort. That's a pretty big difference, and yet I recall no > complaints. The cost of a comparison in a sort can hardly be > considered in isolation, anyway -- cache efficiency is at least as > important. > > Of course, the point is that the goal of a cost model is not to > simulate reality as closely as possible -- it's to produce a good > outcome for performance purposes under realistic assumptions. > Realistic assumptions include that you can't hope to account for > certain differences in cost. Avoiding a terrible outcome is very > important, but the worst case for useselection() is no worse than > today's behavior (or a lost opportunity to do better than today's > behavior). I agree with that. So, the question for any given cost model is: does it model the effects that matter? If you think that the cost of sorting integers vs. sorting text matters to the crossover point, then that should be modeled here. If it doesn't matter, then don't include it. The point is, nobody can tell WHAT effects this is modeling. Increasing the tuple size makes the crossover go up. Or down. > Recently, the paper that was posted to the list about the Postgres > optimizer stated formally what I know I had a good intuitive sense of > for a long time: that better selectivity estimates are much more > important than better cost models in practice. The "empirical > observations" driving something like DEFAULT_EQ_SEL are very weak -- > but what are you gonna do? This analogy is faulty. It's true that when we run across a qual whose selectivity we cannot estimate in any meaningful way, we have to just take a stab in the dark and hope for the best. Similarly, if we have no information about what the crossover point for a given sort is, we'd have to take some arbitrary estimate, like 75%, and hope for the best. But in this case, we DO have information. We have an estimated row count and an estimated row width. And those values are not being ignored, they are getting used. The problem is that they are being used in an arbitrary way that is not justified by any chain of reasoning. > But yes, let me concede more clearly: the cost model is based on > frobbing. But at least it's relatively honest about that, and is > relatively simple. I think it might be possible to make it simpler, > but I have a feeling that anything we can come up with will basically > have the same quality that you so dislike. I don't know how to do > better. Frankly, I'd rather be roughly correct than exactly wrong. Sure, but the fact that the model has huge discontinuities - perhaps most notably a case where adding a single tuple to the estimated cardinality changes the crossover point by a factor of two - suggests that you are probably wrong. The actual behavior does not change sharply when the size of the SortTuple array crosses 1GB, but the estimates do. That means that either the estimates are wrong for 44,739,242 tuples or they are wrong for 44,739,243 tuples. The behavior cannot be right in both cases unless that one extra tuple changes the behavior radically, or unless the estimate doesn't matter in the first place. > By the way, I think that there needs to be a little work done to > cost_sort() too, which so far I've avoided. Yeah, I agree, but that can be a separate topic. >> I agree, but that's not what I proposed. You don't want to keep >> re-sorting to incorporate new tuples into the run, but if you've got >> 1010 tuples and you can fit 1000 tuples in, you can (a) quicksort the >> first 1000 tuples, (b) read in 10 more tuples, dumping the first 10 >> tuples from run 0 to disk, (c) quicksort the last 10 tuples to create >> run 1, and then (d) merge run 0 [which is mostly in memory] with run 1 >> [which is entirely in memory]. In other words, yes, quicksorting >> doesn't let you add things to the sort incrementally, but you can >> still write out the run incrementally, writing only as many tuples as >> you need to dump to get the rest of the input data into memory. > > Merging is
Re: [HACKERS] Using quicksort for every external sort run
On Tue, Dec 22, 2015 at 2:57 PM, Robert Haas wrote: > On Tue, Dec 22, 2015 at 4:37 PM, Peter Geoghegan wrote: >> That's not fair. DEFAULT_EQ_SEL, DEFAULT_RANGE_INEQ_SEL, and >> DEFAULT_NUM_DISTINCT are each about as arbitrary. We have to do >> something, though. >> > Sure, there are arbitrary numbers all over the code, driven by > empirical observations about what factors are important to model. But > this is not that. You don't have a thing called seq_page_cost and a > thing called cpu_tuple_cost and then say, well, empirically the ratio > is about 100:1, so let's make the former 1 and the latter 0.01. You > just have some numbers, and it's not clear what, if anything, they > actually represent. What I find difficult to accept about what you say here is that at *this* level, something like cost_sort() has little to recommend it. It costs a sort of a text attribute at the same level as the cost of sorting the same tuples using an int4 attribute (based on the default cpu_operator_cost for C functions -- without any attempt to differentiate text and int4). Prior to 9.5, sorting text took about 5 - 10 times longer that this similar int4 sort. That's a pretty big difference, and yet I recall no complaints. The cost of a comparison in a sort can hardly be considered in isolation, anyway -- cache efficiency is at least as important. Of course, the point is that the goal of a cost model is not to simulate reality as closely as possible -- it's to produce a good outcome for performance purposes under realistic assumptions. Realistic assumptions include that you can't hope to account for certain differences in cost. Avoiding a terrible outcome is very important, but the worst case for useselection() is no worse than today's behavior (or a lost opportunity to do better than today's behavior). Recently, the paper that was posted to the list about the Postgres optimizer stated formally what I know I had a good intuitive sense of for a long time: that better selectivity estimates are much more important than better cost models in practice. The "empirical observations" driving something like DEFAULT_EQ_SEL are very weak -- but what are you gonna do? > The crossover point is clamped to a minimum of 40% [constant #1] and a > maximum of 85% [constant #2] when the size of the SortTuple array is > no more than MaxAllocSize. Between those bounds, the crossover point > is 90% [constant #3] minus 7.5% [constant #4] per 32-byte increment > [constant #5] of estimated average tuple size. On the other hand, > when the estimated average tuple size exceeds MaxAllocSize, the > crossover point is either 90% [constant #6] or 95% [constant #7] > depending on whether the average tuple size is greater than 32 bytes > [constant #8]. But if the row count hit is less than 50% [constant > #9] of the rows we've already seen, then we ignore it and do not use > selection. > > You make no attempt to justify why any of these numbers are correct, > or what underlying physical reality they represent. Just like selfuncs.h for the most part, then. > The comment which > describes the manner in which crossover point is computed for > SortTuple volumes under 1GB says "Starting from a threshold of 90%, > refund 7.5% per 32 byte average-size-increment." That is a precise > restatement of what the code does, but it doesn't attempt to explain > why it's a good idea. Perhaps the reader should infer that the > crossover point drops as the tuples get bigger, except that in the > over-1GB case, a larger tuple size causes the crossover point to go > *up* while in the under-1GB case, a larger tuple size causes the > crossover point to go *down*. Concretely, if we're sorting 44,739,242 > 224-byte tuples, the estimated crossover point is 40%. If we're > sorting 44,739,243 244-byte tuples, the estimated crossover point is > 95%. That's an extremely sharp discontinuity, and it seems very > unlikely that any real system behaves that way. Again, the goal of the cost model is not to model reality as such. This cost model is conservative about using replacement selection. It makes sense when you consider that there tends to be a lot fewer external sorts on a realistic workload -- if we can cut that number in half, which seems quite possible, that's pretty good, especially from a DBA's practical perspective. I want to buffer DBAs against suddenly incurring more I/O, but not at the risk of having a far longer sort for the first run. Or with minimal exposure to that risk. The cost model weighs the cost of the hint being wrong to some degree (which is indeed novel). I think it makes sense in light of the cost and benefits in this case, although I will add that I'm not entirely comfortable with it. I just don't imagine that there is a solution that I will be fully comfortable with. There may be one that superficially looks correct, but I see little point in that. > I'm prepared to concede that constant #9 - ignoring the input row > estimate if we've alrea
Re: [HACKERS] Using quicksort for every external sort run
On Tue, Dec 22, 2015 at 4:37 PM, Peter Geoghegan wrote: >> This looks like voodoo to me. I assume you tested it and maybe it >> gives correct answers, but it's got to be some kind of world record >> for number of arbitrary constants per SLOC, and there's no real >> justification for any of it. The comments say, essentially, well, we >> do this because it works. But suppose I try it on some new piece of >> hardware and it doesn't work well. What do I do? Email the author >> and ask him to tweak the arbitrary constants? > > That's not fair. DEFAULT_EQ_SEL, DEFAULT_RANGE_INEQ_SEL, and > DEFAULT_NUM_DISTINCT are each about as arbitrary. We have to do > something, though. > > MaxAllocHugeSize is used fairly arbitrarily in pg_stat_statements.c. > And that part (the MaxAllocSize part of my patch) only defines a point > after which we require a really favorable case for replacement > selection/quicksort with spillover to proceed. It's a safety valve. We > try to err on the side of not using replacement selection. Sure, there are arbitrary numbers all over the code, driven by empirical observations about what factors are important to model. But this is not that. You don't have a thing called seq_page_cost and a thing called cpu_tuple_cost and then say, well, empirically the ratio is about 100:1, so let's make the former 1 and the latter 0.01. You just have some numbers, and it's not clear what, if anything, they actually represent. In the space of 7 lines of code, you introduce 9 nameless constants: The crossover point is clamped to a minimum of 40% [constant #1] and a maximum of 85% [constant #2] when the size of the SortTuple array is no more than MaxAllocSize. Between those bounds, the crossover point is 90% [constant #3] minus 7.5% [constant #4] per 32-byte increment [constant #5] of estimated average tuple size. On the other hand, when the estimated average tuple size exceeds MaxAllocSize, the crossover point is either 90% [constant #6] or 95% [constant #7] depending on whether the average tuple size is greater than 32 bytes [constant #8]. But if the row count hit is less than 50% [constant #9] of the rows we've already seen, then we ignore it and do not use selection. You make no attempt to justify why any of these numbers are correct, or what underlying physical reality they represent. The comment which describes the manner in which crossover point is computed for SortTuple volumes under 1GB says "Starting from a threshold of 90%, refund 7.5% per 32 byte average-size-increment." That is a precise restatement of what the code does, but it doesn't attempt to explain why it's a good idea. Perhaps the reader should infer that the crossover point drops as the tuples get bigger, except that in the over-1GB case, a larger tuple size causes the crossover point to go *up* while in the under-1GB case, a larger tuple size causes the crossover point to go *down*. Concretely, if we're sorting 44,739,242 224-byte tuples, the estimated crossover point is 40%. If we're sorting 44,739,243 244-byte tuples, the estimated crossover point is 95%. That's an extremely sharp discontinuity, and it seems very unlikely that any real system behaves that way. I'm prepared to concede that constant #9 - ignoring the input row estimate if we've already seen twice that many rows - probably doesn't need a whole lot of justification here, and what justification it does need is provided by the fact that (we think) replacement selection only wins when there are going to be less than 2 quicksorted runs. But the other 8 constants here have to have reasons why they exist, what they represent, and why they have the values they do, and that explanation needs to be something that can be understood by people besides you. The overall cost model needs some explanation of the theory of operation, too. In my opinion, reasoning in terms of a crossover point is a strange way of approaching the problem. What would be more typical at least in our code, and I suspect in general, is do a cost estimate of using selection and a cost estimate of not using selection and compare them. Replacement selection has a CPU cost and an I/O cost, each of which is estimable based on the tuple count, chosen comparator, and expected I/O volume. Quicksort has those same costs, in different amounts. If those respective costs are accurately estimated, then you can pick the strategy with the lower cost and expect to win. >> I wonder if there's a way to accomplish what you're trying to do here >> that avoids the need to have a cost model at all. As I understand it, >> and please correct me wherever I go off the rails, the situation is: >> >> 1. If we're sorting a large amount of data, such that we can't fit it >> all in memory, we will need to produce a number of sorted runs and >> then merge those runs. If we generate each run using a heap with >> replacement selection, rather than quicksort, we will produce runs >> that are, on the average, about twi
Re: [HACKERS] Using quicksort for every external sort run
On Tue, Dec 22, 2015 at 9:10 AM, Robert Haas wrote: > So I was looking at the 0001 patch Thanks. I'm going to produce a revision of 0002 shortly, so perhaps hold off on that one. The big change there will be to call grow_memtuples() to allow us to increase the number of slots without palloc() overhead spuriously being weighed (since the memory for the final on-the-fly merge phase doesn't have palloc() overhead). Also, will incorporate what Jeff and you wanted around terminology. > This looks like voodoo to me. I assume you tested it and maybe it > gives correct answers, but it's got to be some kind of world record > for number of arbitrary constants per SLOC, and there's no real > justification for any of it. The comments say, essentially, well, we > do this because it works. But suppose I try it on some new piece of > hardware and it doesn't work well. What do I do? Email the author > and ask him to tweak the arbitrary constants? That's not fair. DEFAULT_EQ_SEL, DEFAULT_RANGE_INEQ_SEL, and DEFAULT_NUM_DISTINCT are each about as arbitrary. We have to do something, though. MaxAllocHugeSize is used fairly arbitrarily in pg_stat_statements.c. And that part (the MaxAllocSize part of my patch) only defines a point after which we require a really favorable case for replacement selection/quicksort with spillover to proceed. It's a safety valve. We try to err on the side of not using replacement selection. > I wonder if there's a way to accomplish what you're trying to do here > that avoids the need to have a cost model at all. As I understand it, > and please correct me wherever I go off the rails, the situation is: > > 1. If we're sorting a large amount of data, such that we can't fit it > all in memory, we will need to produce a number of sorted runs and > then merge those runs. If we generate each run using a heap with > replacement selection, rather than quicksort, we will produce runs > that are, on the average, about twice as long, which means that we > will have fewer runs to merge at the end. > > 2. Replacement selection is slower than quicksort on a per-tuple > basis. Furthermore, merging more runs isn't necessarily any slower > than merging fewer runs. Therefore, building runs via replacement > selection tends to lose even though it tends to reduce the number of > runs to merge. Even when having a larger number of runs results in an > increase in the number merge passes, we save so much time building the > runs that we often (maybe not always) still come out ahead. I'm with you so far. I'll only add: doing multiple passes ought to be very rare anyway. > 3. However, when replacement selection would result in a single run, > and quicksort results in multiple runs, using quicksort loses. This > is especially true when we the amount of data we have is between one > and two times work_mem. If we fit everything into one run, we do not > need to write any data to tape, but if we overflow by even a single > tuple, we have to write a lot of data to tape. No, this is where you lose me. I think that it's basically not true that replacement selection can ever be faster than quicksort, even in the cases where the conventional wisdom would have you believe so (e.g. what you say here). Unless you have very little memory relative to data size, or something along those lines. The conventional wisdom obviously needs some revision, but it was perfectly correct in the 1970s and 1980s. However, where replacement selection can still help is avoiding I/O *entirely*. If we can avoid spilling 95% of tuples in the first place, and quicksort the remaining (heapified) tuples that were not spilled, and merge an in-memory run with an on-tape run, then we can win big. Quicksort is not amenable to incremental spilling at all. I call this "quicksort with spillover" (it is a secondary optimization that the patch adds). This shows up in EXPLAIN ANALYZE, and avoids a stark discontinuity in the cost function of sorts. That could really help with admission control, and simplifying the optimizer, making merge joins less scary. So with the patch, "quicksort with spillover" and "replacement selection" are almost synonymous, except that we acknowledge the historic importance of replacement selection to some degree. The patch completely discards the conventional use of replacement selection -- it just preserves its priority queue (heap) implementation where incrementalism is thought to be particularly useful (avoiding I/O entirely). But this comparison has nothing to do with comparing the master branch with my patch, since the master branch never attempts to avoid I/O having committed to an external sort. It uses replacement selection in a way that is consistent with the conventional wisdom, wisdom which has now been shown to be obsolete. BTW, I think that abandoning incrementalism (replacement selection) will have future benefits for memory management. I bet we can get away with one big palloc() for second or subsequent
Re: [HACKERS] Using quicksort for every external sort run
On Sun, Dec 6, 2015 at 7:25 PM, Peter Geoghegan wrote: > On Tue, Nov 24, 2015 at 4:33 PM, Peter Geoghegan wrote: >> So, the bottom line is: This patch seems very good, is unlikely to >> have any notable downside (no case has been shown to be regressed), >> but has yet to receive code review. I am working on a new version with >> the first two commits consolidated, and better comments, but that will >> have the same code, unless I find bugs or am dissatisfied. It mostly >> needs thorough code review, and to a lesser extent some more >> performance testing. > > I'm currently spending a lot of time working on parallel CREATE INDEX. > I should not delay posting a new version of my patch series any > further, though. I hope to polish up parallel CREATE INDEX to be able > to show people something in a couple of weeks. > > This version features consolidated commits, the removal of the > multipass_warning parameter, and improved comments and commit > messages. It has almost entirely unchanged functionality. > > The only functional changes are: > > * The function useselection() is taught to distrust an obviously bogus > caller reltuples hint (when it's already less than half of what we > know to be the minimum number of tuples that the sort must sort, > immediately after LACKMEM() first becomes true -- this is probably a > generic estimate). > > * Prefetching only occurs when writing tuples. Explicit prefetching > appears to hurt in some cases, as David Rowley has shown over on the > dedicated thread. But it might still be that writing tuples is a case > that is simple enough to benefit consistently, due to the relatively > uniform processing that memory latency can hide behind for that case > (before, the same prefetching instructions were used for CREATE INDEX > and for aggregates, for example). > > Maybe we should consider trying to get patch 0002 (the memory > pool/merge patch) committed first, something Greg Stark suggested > privately. That might actually be an easier way of integrating this > work, since it changes nothing about the algorithm we use for merging > (it only improves memory locality), and so is really an independent > piece of work (albeit one that makes a huge overall difference due to > the other patches increasing the time spent merging in absolute terms, > and especially as a proportion of the total). So I was looking at the 0001 patch and came across this code: +/* + * Crossover point is somewhere between where memtuples is between 40% + * and all-but-one of total tuples to sort. This weighs approximate + * savings in I/O, against generic heap sorting cost. + */ +avgTupleSize = (double) memNowUsed / (double) state->memtupsize; + +/* + * Starting from a threshold of 90%, refund 7.5% per 32 byte + * average-size-increment. + */ +increments = MAXALIGN_DOWN((int) avgTupleSize) / 32; +crossover = 0.90 - (increments * 0.075); + +/* + * Clamp, making either outcome possible regardless of average size. + * + * 40% is about the minimum point at which "quicksort with spillover" + * can still occur without a logical/physical correlation. + */ +crossover = Max(0.40, Min(crossover, 0.85)); + +/* + * The point where the overhead of maintaining the heap invariant is + * likely to dominate over any saving in I/O is somewhat arbitrarily + * assumed to be the point where memtuples' size exceeds MaxAllocSize + * (note that overall memory consumption may be far greater). Past + * this point, only the most compelling cases use replacement selection + * for their first run. + * + * This is not about cache characteristics so much as the O(n log n) + * cost of sorting larger runs dominating over the O(n) cost of + * writing/reading tuples. + */ +if (sizeof(SortTuple) * state->memtupcount > MaxAllocSize) +crossover = avgTupleSize > 32 ? 0.90 : 0.95; This looks like voodoo to me. I assume you tested it and maybe it gives correct answers, but it's got to be some kind of world record for number of arbitrary constants per SLOC, and there's no real justification for any of it. The comments say, essentially, well, we do this because it works. But suppose I try it on some new piece of hardware and it doesn't work well. What do I do? Email the author and ask him to tweak the arbitrary constants? The dependency on MaxAllocSize seems utterly bizarre to me. If we decide to modify our TOAST infrastructure so that we support datums up to 2GB in size, or alternatively datums of up to only 512MB in size, do you expect that to change the behavior of tuplesort.c? I bet not, but that's a major reason why MaxAllocSize is defined the way it is. I wonder if there's a way to accomplish what you're trying to do here that avoids the need to have a cost model at all. As I understand it, and please correct me wherever I go off the rails, the situation is: 1. If we're sorting a large amoun
Re: [HACKERS] Using quicksort for every external sort run
On Fri, Dec 18, 2015 at 12:50 PM, Robert Haas wrote: >> BTW, I'm not necessarily determined to make the new special-purpose >> allocator work exactly as proposed. It seemed useful to prioritize >> simplicity, and currently so there is one big "huge palloc()" with >> which we blow our memory budget, and that's it. However, I could >> probably be more clever about "freeing ranges" initially preserved for >> a now-exhausted tape. That kind of thing. > > What about the case where we think that there will be a lot of data > and have a lot of work_mem available, but then the user sends us 4 > rows because of some mis-estimation? The memory patch only changes the final on-the-fly merge phase. There is no estimate involved there. I continue to use whatever "slots" (memtuples) are available for the final on-the-fly merge. However, I allocate all remaining memory that I have budget for at once. My remarks about the efficient use of that memory was only really about each tape's use of their part of that over time. Again, to emphasize, this is only for the final on-the-fly merge phase. >> With the on-the-fly merge memory patch, I'm improving locality of >> access (for each "tuple proper"/"tuple itself"). If I also happen to >> improve the situation around palloc() fragmentation at the same time, >> then so much the better, but that's clearly secondary. > > I don't really understand this comment. I just mean that I wrote the memory patch with memory locality in mind, not palloc() fragmentation or other overhead. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Fri, Dec 18, 2015 at 2:57 PM, Peter Geoghegan wrote: > On Fri, Dec 18, 2015 at 10:12 AM, Robert Haas wrote: >> I don't really like the term "memory pool" either. We're growing a >> bunch of little special-purpose allocators all over the code base >> because of palloc's somewhat dubious performance and memory usage >> characteristics, but if any of those are referred to as memory pools >> it has thus far escaped my notice. > > BTW, I'm not necessarily determined to make the new special-purpose > allocator work exactly as proposed. It seemed useful to prioritize > simplicity, and currently so there is one big "huge palloc()" with > which we blow our memory budget, and that's it. However, I could > probably be more clever about "freeing ranges" initially preserved for > a now-exhausted tape. That kind of thing. What about the case where we think that there will be a lot of data and have a lot of work_mem available, but then the user sends us 4 rows because of some mis-estimation? > With the on-the-fly merge memory patch, I'm improving locality of > access (for each "tuple proper"/"tuple itself"). If I also happen to > improve the situation around palloc() fragmentation at the same time, > then so much the better, but that's clearly secondary. I don't really understand this comment. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Fri, Dec 18, 2015 at 10:12 AM, Robert Haas wrote: > I don't really like the term "memory pool" either. We're growing a > bunch of little special-purpose allocators all over the code base > because of palloc's somewhat dubious performance and memory usage > characteristics, but if any of those are referred to as memory pools > it has thus far escaped my notice. BTW, I'm not necessarily determined to make the new special-purpose allocator work exactly as proposed. It seemed useful to prioritize simplicity, and currently so there is one big "huge palloc()" with which we blow our memory budget, and that's it. However, I could probably be more clever about "freeing ranges" initially preserved for a now-exhausted tape. That kind of thing. With the on-the-fly merge memory patch, I'm improving locality of access (for each "tuple proper"/"tuple itself"). If I also happen to improve the situation around palloc() fragmentation at the same time, then so much the better, but that's clearly secondary. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Fri, Dec 18, 2015 at 10:12 AM, Robert Haas wrote: > Anyway, I agree with Jeff that this terminology shouldn't creep into > function and structure member names. Okay. > I don't really like the term "memory pool" either. We're growing a > bunch of little special-purpose allocators all over the code base > because of palloc's somewhat dubious performance and memory usage > characteristics, but if any of those are referred to as memory pools > it has thus far escaped my notice. It's a widely accepted term: https://en.wikipedia.org/wiki/Memory_pool But, sure, I'm not attached to it. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Sat, Dec 12, 2015 at 5:28 PM, Peter Geoghegan wrote: > On Sat, Dec 12, 2015 at 12:10 AM, Jeff Janes wrote: >> I have a question about the terminology used in this patch. What is a >> tuple proper? What is it in contradistinction to? I would think that >> a tuple which is located in its own palloc'ed space is the "proper" >> one, leaving a tuple allocated in the bulk memory pool to be >> called...something else. I don't know what the >> non-judgmental-sounding antonym of postpositive "proper" is. > > "Tuple proper" is a term that appears 5 times in tuplesort.c today. As > it says at the top of that file: > > /* > * The objects we actually sort are SortTuple structs. These contain > * a pointer to the tuple proper (might be a MinimalTuple or IndexTuple), > * which is a separate palloc chunk --- we assume it is just one chunk and > * can be freed by a simple pfree(). SortTuples also contain the tuple's > * first key column in Datum/nullflag format, and an index integer. I see only three. In each case, "the tuple proper" could be replaced by "the tuple itself" or "the actual tuple" without changing the meaning, at least according to my understanding of the meaning. If that's causing confusion, perhaps we should just change the existing wording. Anyway, I agree with Jeff that this terminology shouldn't creep into function and structure member names. I don't really like the term "memory pool" either. We're growing a bunch of little special-purpose allocators all over the code base because of palloc's somewhat dubious performance and memory usage characteristics, but if any of those are referred to as memory pools it has thus far escaped my notice. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Mon, Dec 14, 2015 at 7:22 PM, Peter Geoghegan wrote: > Thanks for taking the time to benchmark the patch! Also, I should point out that you didn't add work_mem past the point where the master branch will get slower, while the patch continues to get faster. This seems to happen fairly reliably, certainly if work_mem is sized at about 1GB, and often at lower settings. With the POWER7 "Hydra" server, external sorting for a CREATE INDEX operation could put any possible maintenance_work_mem setting to good use -- my test case got faster with a 15GB maintenance_work_mem setting (the server has 64GB of ram). I think I tried 25GB as a maintenance_work_mem setting next, but started to get OOM errors at that point. Again, I point this out because I want to account for why my numbers were better (for the benefit of other people -- I think you get this, and are being fair). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Mon, Dec 14, 2015 at 6:58 PM, Greg Stark wrote: > I ran sorts with various parameters on my small NAS server. ... > without the extra memory optimizations. Thanks for taking the time to benchmark the patch! While I think it's perfectly fair that you didn't apply the final on-the-fly merge "memory pool" patch, I also think that it's quite possible that the regression you see at the very low end would be significantly ameliorated or even eliminated by applying that patch, too. After all, Jeff Janes had a much harder time finding a regression, probably because he benchmarked all patches together. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
I ran sorts with various parameters on my small NAS server. This is a fairly slow CPU and limited memory machine with lots of disk so I thought it would actually make a good test case for smaller servers. The following is the speedup (for values < 100%) or slowdown (values > 100%) for the first patch only, the "quicksort all runs" without the extra memory optimizations. At first glance it's a clear pattern that the extra runs does cause a slowdown whenever it causes more polyphase merges which is bad news. But on further inspection look just how low work_mem had to be to have a significant effect. Only the 4MB and 8MB work_mem cases were significantly impacted and only when sorting over a GB of data (which was 2.7 - 7GB with the tuple overhead). The savings when work_mem was 64MB or 128MB was substantial. Table SizeSort Size128MB64MB32MB16MB8MB4MB6914MB2672 MB64%70%93%110%133%137% 3457MB1336 MB64%67%90%92%137%120%2765MB1069 MB68%66%84%95%111%137%1383MB535 MB66%70%72%92%99%96%691MB267 MB65%69%70%86%99%98%346MB134 MB65%69%73%67%90% 87% The raw numbers in seconds. I've only run the test once so far on the NAS and there are some other things running on it so I really should rerun it a few more times at least. HEAD: Table SizeSort Size128MB64MB32MB16MB8MB4MB6914MB2672 MB1068.07963.231041.94 1246.541654.352472.793457MB1336 MB529.34482.3450.77555.76657.341027.572765MB1069 MB404.02394.36348.31414.48507.38657.171383MB535 MB196.48194.26173.48182.57 214.42258.05691MB267 MB95.9393.7987.7380.493.67105.24346MB134 MB45.644.24 42.3944.2246.1749.85 With the quicksort patch: Table SizeSort Size128MB64MB32MB16MB8MB4MB6914MB2672 MB683.6679.0969.41366.2 2193.63379.33457MB1336 MB339.1325.1404.9509.8902.21229.12765MB1069 MB275.3 260.1292.4395.4561.9898.71383MB535 MB129.9136.4124.6167.5213.2247.1691MB267 MB62.364.361.469.292.3103.2346MB134 MB29.830.730.929.441.643.4
Re: [HACKERS] Using quicksort for every external sort run
On Sun, Dec 13, 2015 at 7:31 PM, Jeff Janes wrote: > The reason for memtuples is to handle random access. Since we are no > longer doing random access, we no longer need it. > > We could free memtuples, re-allocate just enough to form the binary > heap for the N-way merge, and use all the rest of that space (which > could be a significant fraction of work_mem) as part of the new pool. Oh, you're talking about having the final on-the-fly merge use a tuplestore-style array of pointers to "tuple proper" memory (this was how tuplesort.c worked in all cases about 15 years ago, actually). I thought about that. It's not obvious how we'd do without SortTuple.tupindex during the merge phase, since it sometimes represents an offset into memtuples (the SortTuple array). See "free list" management within mergeonerun(). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Sun, Dec 13, 2015 at 3:40 PM, Peter Geoghegan wrote: > On Sat, Dec 12, 2015 at 4:41 PM, Jeff Janes wrote: > Also, if I am reading this correctly, when we refill a pool from a logical tape we still transform each tuple as it is read from the disk format to the memory format. This inflates the size quite a bit, at least for single-datum tuples. If we instead just read the disk format directly into the pool, and converted them into the in-memory format when each tuple came due for the merge heap, would that destroy the locality of reference you are seeking to gain? >>> >>> Are you talking about alignment? >> >> Maybe alignment, but also the size of the SortTuple struct itself, >> which is not present on tape but is present in memory if I understand >> correctly. >> >> When reading 128kb (32 blocks) worth of in-memory pool, it seems like >> it only gets to read 16 to 18 blocks of tape to fill them up, in the >> case of building an index on single column 32-byte random md5 digests. >> I don't exactly know where all of that space goes, I'm taking an >> experimentalist approach. > > I'm confused. > > readtup_datum(), just like every other READTUP() variant, has the new > function tupproperalloc() as a drop-in replacement for the master > branch palloc() + USEMEM() calls. Right, I'm not comparing what your patch does to what the existing code does. I'm comparing it to what it could be doing. Only call READTUP when you need to go from the pool to the heap, not when you need to go from tape to the pool. If you store the data in the pool the same way they are stored on tape, then we no longer need memtuples at all. There is already a "mergetupcur" per tape pointing to the first tuple of the tape, and since they are now stored contiguously that is all that is needed, once you are done with one tuple the pointer is left pointing at the next one. The reason for memtuples is to handle random access. Since we are no longer doing random access, we no longer need it. We could free memtuples, re-allocate just enough to form the binary heap for the N-way merge, and use all the rest of that space (which could be a significant fraction of work_mem) as part of the new pool. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Sat, Dec 12, 2015 at 4:41 PM, Jeff Janes wrote: > Those usages make sense to me, as they are locally self-contained and > it is clear what they are in contradistinction to. But your usage is > spread throughout (even in function names, not just comments) and > seems to contradict the current usage as yours are not separately > palloced, as the "proper" ones described here are. I think that > "proper" only works when the same comment also defines the > alternative, rather than as some file-global description. Maybe > "pooltuple" rather than "tupleproper" I don't think of it that way. The "tuple proper" is the thing that the client passes to their tuplesort -- the thing they are actually interested in having sorted. Like an IndexTuple for CREATE INDEX callers, for example. SortTuple is just an internal implementation detail. (That appears all over the file tuplesort.c, just as my new references to "tuple proper" do. But neither appear elsewhere.) >>> Also, if I am reading this correctly, when we refill a pool from a >>> logical tape we still transform each tuple as it is read from the disk >>> format to the memory format. This inflates the size quite a bit, at >>> least for single-datum tuples. If we instead just read the disk >>> format directly into the pool, and converted them into the in-memory >>> format when each tuple came due for the merge heap, would that destroy >>> the locality of reference you are seeking to gain? >> >> Are you talking about alignment? > > Maybe alignment, but also the size of the SortTuple struct itself, > which is not present on tape but is present in memory if I understand > correctly. > > When reading 128kb (32 blocks) worth of in-memory pool, it seems like > it only gets to read 16 to 18 blocks of tape to fill them up, in the > case of building an index on single column 32-byte random md5 digests. > I don't exactly know where all of that space goes, I'm taking an > experimentalist approach. I'm confused. readtup_datum(), just like every other READTUP() variant, has the new function tupproperalloc() as a drop-in replacement for the master branch palloc() + USEMEM() calls. It is true that tupproperalloc() (and a couple of other places relating to preloading) know *a little* about the usage pattern -- tupproperalloc() accepts a "tape number" argument to know what partition within the large pool/buffer to use for each logical allocation. However, from the point of view of correctness, tupproperalloc() should function as a drop-in replacement for palloc() + USEMEM() calls in the context of the various READTUP() routines. I have done nothing special with any particular READTUP() routine, including readtup_datum() (all READTUP() routines have received the same treatment). Nothing else was changed in those routines, including how tuples are stored on tape. The datum case does kind of store the SortTuples on tape today in one very limited sense, which is that the length is stored fairly naively (that's already available from the IndexTuple in the case of writetup_index(), for example, but length must be stored explicitly for the datum case). My guess is you're confusion comes from the fact that the memtuples array (the array of SortTuple) is also factored in to memory accounting, but that grows at geometric intervals, whereas the existing READTUP() retail palloc() calls (and their USEMEM() memory accounting calls) occur in drips and drabs. It's probably the case that the sizing of the memtuples array and the amount of memory we use for that rather than retail palloc()/"tuple proper" memory is a kind of arbitrary (why should the needs be the same when SortTuples are merge step "slots"?), but I don't think that's the biggest problem in this general area at all. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Tue, Dec 8, 2015 at 6:39 PM, Greg Stark wrote: > On Wed, Dec 9, 2015 at 12:02 AM, Jeff Janes wrote: >> >> >> Then in the next (final) merge, it is has to read in this huge >> fragmented tape run emulation, generating a lot of random IO to read >> it. > > This seems fairly plausible. Logtape.c is basically implementing a > small filesystem and doesn't really make any attempt to avoid > fragmentation. The reason it does this is so that we can reuse blocks > and avoid needing to store 2x disk space for the temporary space. I > wonder if we're no longer concerned about keeping the number of tapes > down if it makes sense to give up on this goal too and just write out > separate files for each tape letting the filesystem avoid > fragmentation. I suspect it would also be better for filesystems like > ZFS and SSDs where rewriting blocks can be expensive. During my testing I actually ran into space problems, where the index I was building and the temp files used to do the sort for it could not coexist, and I was wondering if there wasn't a way to free up some of those temp files as the index was growing. So I don't think we want to throw caution to the wind here. (Also, I think it does make *some* attempt to reduce fragmentation, but it could probably do more.) > > >> With the patched code, the average length of reads on files in >> pgsql_tmp between lseeks or changing to a different file descriptor is >> 8, while in the unpatched code it is 14. > > I don't think Peter did anything to the scheduling of the merges so I > don't see how this would be different. It might just have hit a > preexisting case by changing the number and size of tapes. Correct. (There was a small additional increase with the memory pool, but it was small enough that I am not worried about it). But, this changing number and size of tapes was exactly what Robert was worried about, so I don't want to just dismiss it without further investigation. > > I also don't think the tapes really ought to be so unbalanced. I've > noticed some odd things myself -- like what does a 1-way merge mean > here? I noticed some of those (although in my case they were always the first merges which were one-way) and I just attributed it to the fact that the algorithm doesn't know how many runs there will be up front, and so can't optimally distribute them among the tapes. But it does occur to me that we are taking the tape analogy rather too far in that case. We could say that we have only 223 tape *drives*, but that each run is a separate tape which can be remounted amongst the drives in any combination, as long as only 223 are active at one time. I started looking into this at one time, before I got sidetracked on the fact that the memory usage pattern would often leave a few bytes less than half of work_mem completely unused. Once that memory usage got fixed, I never returned to the original examination. And it would be a shame to sink more time into it now, when we are trying to avoid these polyphase merges altogether. So, is a sometimes-regression at 64MB really a blocker to substantial improvement most of the time at 64MB, and even more so at more realistic modern settings for large index building? > Fwiw attached are two patches for perusal. One is a trivial patch to > add the size of the tape to trace_sort output. I guess I'll just apply > that without discussion. +1 there. Having this in place would make evaluating the other things be easier. Cheers, Jeff On Tue, Dec 8, 2015 at 6:39 PM, Greg Stark wrote: > On Wed, Dec 9, 2015 at 12:02 AM, Jeff Janes wrote: >> >> >> Then in the next (final) merge, it is has to read in this huge >> fragmented tape run emulation, generating a lot of random IO to read >> it. > > This seems fairly plausible. Logtape.c is basically implementing a > small filesystem and doesn't really make any attempt to avoid > fragmentation. The reason it does this is so that we can reuse blocks > and avoid needing to store 2x disk space for the temporary space. I > wonder if we're no longer concerned about keeping the number of tapes > down if it makes sense to give up on this goal too and just write out > separate files for each tape letting the filesystem avoid > fragmentation. I suspect it would also be better for filesystems like > ZFS and SSDs where rewriting blocks can be expensive. > > >> With the patched code, the average length of reads on files in >> pgsql_tmp between lseeks or changing to a different file descriptor is >> 8, while in the unpatched code it is 14. > > I don't think Peter did anything to the scheduling of the merges so I > don't see how this would be different. It might just have hit a > preexisting case by changing the number and size of tapes. > > I also don't think the tapes really ought to be so unbalanced. I've > noticed some odd things myself -- like what does a 1-way merge mean > here? > > LOG: finished writing run 56 to tape 2 (9101313 blocks): CPU > 0.19s/10.97u sec elapsed 16.68 s
Re: [HACKERS] Using quicksort for every external sort run
On Sat, Dec 12, 2015 at 2:28 PM, Peter Geoghegan wrote: > On Sat, Dec 12, 2015 at 12:10 AM, Jeff Janes wrote: >> I have a question about the terminology used in this patch. What is a >> tuple proper? What is it in contradistinction to? I would think that >> a tuple which is located in its own palloc'ed space is the "proper" >> one, leaving a tuple allocated in the bulk memory pool to be >> called...something else. I don't know what the >> non-judgmental-sounding antonym of postpositive "proper" is. > > "Tuple proper" is a term that appears 5 times in tuplesort.c today. As > it says at the top of that file: > > /* > * The objects we actually sort are SortTuple structs. These contain > * a pointer to the tuple proper (might be a MinimalTuple or IndexTuple), > * which is a separate palloc chunk --- we assume it is just one chunk and > * can be freed by a simple pfree(). SortTuples also contain the tuple's > * first key column in Datum/nullflag format, and an index integer. Those usages make sense to me, as they are locally self-contained and it is clear what they are in contradistinction to. But your usage is spread throughout (even in function names, not just comments) and seems to contradict the current usage as yours are not separately palloced, as the "proper" ones described here are. I think that "proper" only works when the same comment also defines the alternative, rather than as some file-global description. Maybe "pooltuple" rather than "tupleproper" > >> Also, if I am reading this correctly, when we refill a pool from a >> logical tape we still transform each tuple as it is read from the disk >> format to the memory format. This inflates the size quite a bit, at >> least for single-datum tuples. If we instead just read the disk >> format directly into the pool, and converted them into the in-memory >> format when each tuple came due for the merge heap, would that destroy >> the locality of reference you are seeking to gain? > > Are you talking about alignment? Maybe alignment, but also the size of the SortTuple struct itself, which is not present on tape but is present in memory if I understand correctly. When reading 128kb (32 blocks) worth of in-memory pool, it seems like it only gets to read 16 to 18 blocks of tape to fill them up, in the case of building an index on single column 32-byte random md5 digests. I don't exactly know where all of that space goes, I'm taking an experimentalist approach. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Sat, Dec 12, 2015 at 12:10 AM, Jeff Janes wrote: > I have a question about the terminology used in this patch. What is a > tuple proper? What is it in contradistinction to? I would think that > a tuple which is located in its own palloc'ed space is the "proper" > one, leaving a tuple allocated in the bulk memory pool to be > called...something else. I don't know what the > non-judgmental-sounding antonym of postpositive "proper" is. "Tuple proper" is a term that appears 5 times in tuplesort.c today. As it says at the top of that file: /* * The objects we actually sort are SortTuple structs. These contain * a pointer to the tuple proper (might be a MinimalTuple or IndexTuple), * which is a separate palloc chunk --- we assume it is just one chunk and * can be freed by a simple pfree(). SortTuples also contain the tuple's * first key column in Datum/nullflag format, and an index integer. > Also, if I am reading this correctly, when we refill a pool from a > logical tape we still transform each tuple as it is read from the disk > format to the memory format. This inflates the size quite a bit, at > least for single-datum tuples. If we instead just read the disk > format directly into the pool, and converted them into the in-memory > format when each tuple came due for the merge heap, would that destroy > the locality of reference you are seeking to gain? Are you talking about alignment? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Sat, Dec 12, 2015 at 7:42 PM, Ants Aasma wrote: > As the open coding doesn't help with eliminating control flow > dependencies, so my idea is to encode the sort network comparison > order in an array and use that to drive a simple loop. The code size > would be pretty similar to insertion sort and the loop overhead should > mostly be hidden by the CPU OoO machinery. Probably won't help much, > but would be interesting and simple enough to try out. Can you share > you code for the benchmark so I can try it out? I can. But the further results showing the number of comparisons is higher than for insertion sort have dampened my enthusiasm for the change. I'm assuming that even if it's faster for a simple integer or sort it'll be much slower for anything that requires calling out to the datatype comparator. I also hadn't actually measured what percentage of the sort was being spent in the insertion sort. I had guessed it would be higher. The test is attached. qsort_tuple.c is copied from tuplesort (with the ifdef for NOPRESORT added, but you could skip that if you want.). Compile with something like: gcc -DNOPRESORT -O3 -DCOUNTS -Wall -Wno-unused-function simd-sort-test.c -- greg #include #include #include #include #include #include #include #ifdef COUNTS unsigned nswaps; unsigned ncmps; #define countswap (nswaps++) #define countcmp (ncmps++) #else #define countswap ((void)0) #define countcmp ((void)0) #endif typedef void *Tuplesortstate; typedef long long int Datum; typedef int bool; typedef struct { void *tuple; /* the tuple proper */ Datum datum1; /* value of first key column */ bool isnull1; /* is first key column NULL? */ int tupindex; /* see notes above */ } SortTuple; typedef SortTuple elem_t; typedef int (*SortTupleComparator) (const SortTuple *a, const SortTuple *b, Tuplesortstate *state); typedef struct SortSupportData *SortSupport; static int comparator(Datum a, Datum b, SortSupport ssup) { if (b > a) return -1; else if (b < a) return 1; else return 0; }; struct SortSupportData { bool ssup_nulls_first; bool ssup_reverse; int (*comparator)(Datum,Datum,SortSupport); }; struct SortSupportData ssup = {0, 0, comparator}; static inline int ApplySortComparator(Datum datum1, bool isNull1, Datum datum2, bool isNull2, SortSupport ssup) { int compare; countcmp; if (isNull1) { if (isNull2) compare = 0; /* NULL "=" NULL */ else if (ssup->ssup_nulls_first) compare = -1; /* NULL "<" NOT_NULL */ else compare = 1; /* NULL ">" NOT_NULL */ } else if (isNull2) { if (ssup->ssup_nulls_first) compare = 1; /* NOT_NULL ">" NULL */ else compare = -1; /* NOT_NULL "<" NULL */ } else { compare = (*ssup->comparator) (datum1, datum2, ssup); if (ssup->ssup_reverse) compare = -compare; } return compare; } #define CHECK_FOR_INTERRUPTS() do {} while (0) #define Min(a,b) ((a)<(b)?(a):(b)) #define Max(a,b) ((a)<(b)?(b):(a)) #include "qsort_tuple.c" #define mycmp(a,b) \ (ApplySortComparator(list[a].datum1, list[a].isnull1, \ list[b].datum1, list[b].isnull1, \ &ssup)) #define myswap(a,b) \ do { \ elem_t _tmp;\ _tmp = list[a];\ list[a] = list[b]; \ list[b] = _tmp;\ countswap; \ } while (0) #define myswapif(a,b)\ do { \ if (mycmp(a,b) > 0) \ myswap(a,b);\ } while (0) static int insertion_ok(unsigned N) { return N<1000; } static void insertion(elem_t *list, unsigned N) { if (N > 1000) { printf("insertion sort not feasible for large N\n"); exit(1); } for (unsigned pm = 1; pm < N; pm++) for (unsigned pl = pm; pl && mycmp(pl-1, pl) > 0; pl--) myswap(pl, pl-1); } static int sort_networks_ok(unsigned N) { return N<=8; } static void sort_networks(elem_t *list, unsigned N) { if (N > 8) { printf("sort_networks only implemented for N in 0..8\n"); exit(1); } switch(N) { case 0: case 1: break; case 2: myswapif(0,1); break; case 3: myswapif(0,1); myswapif(1,2); myswapif(0,1); break; case 4: myswapif(0,1); myswapif(2,3); myswapif(1,3); myswapif(0,2); myswapif(1,2); break; case 5: myswapif(1,2); myswapif(3,4); myswapif(1,3); myswapif(0,2); myswapif(2,4); myswapif(0,3); myswapif(0,1); myswapif(2,3); myswapif(1,2); break; case 6: myswapif(0,1); myswapif(2,3); myswapif(4,5); myswapif(0,2); myswapif(3,5); myswapif(1,4); myswapif(0,1); myswapif(2,3); myswapif(4,5); myswapif(1,2); myswapif(3,4); myswapif(2,3); break; case 7: myswapif(1,2); myswapif(3,4); myswapif(5,6); myswapif(0,2); myswapif(4,6); myswapif(3,5); myswapif(2,6); myswapif(1,5); myswapif(0,4); myswapif(2,5); myswapif(0,3); myswapif(2,4); myswapif(1,3); myswapif(0,1); myswapif(2,3); myswapif(4,5); break; case 8: myswapif(0, 1); myswapif(2, 3); myswapif(4, 5); myswapif(6, 7); myswapif(0, 3); myswapif(1, 2); myswapif(4, 7); myswapif(5, 6); myswapif(0, 1); myswapif(2,
Re: [HACKERS] Using quicksort for every external sort run
On Sat, Dec 12, 2015 at 12:41 AM, Greg Stark wrote: > On Wed, Dec 9, 2015 at 2:44 AM, Peter Geoghegan wrote: >> On Tue, Dec 8, 2015 at 6:39 PM, Greg Stark wrote: >> >> I guess you mean insertion sort. What's the theoretical justification >> for the change? > > Well my thinking was that hard coding a series of comparisons would be > faster than a loop doing a O(n^2) algorithm even for small constants. > And sort networks are perfect for hard coded sorts because they do the > same comparisons regardless of the results of previous comparisons so > there are no branches. And even better the comparisons are as much as > possible independent of each other -- sort networks are typically > measured by the depth which assumes any comparisons between disjoint > pairs can be done in parallel. Even if it's implemented in serial the > processor is probably parallelizing some of the work. > > So I implemented a quick benchmark outside Postgres based on sorting > actual SortTuples with datum1 defined to be random 64-bit integers (no > nulls). Indeed the sort networks perform faster on average despite > doing more comparisons. That makes me think the cpu is indeed doing > some of the work in parallel. The open coded version you shared bloats the code by 37kB, I'm not sure it is pulling it's weight, especially given relatively heavy comparators. A quick index creation test on int4's profiled with perf shows about 3% of CPU being spent in the code being replaced. Any improvement on that is going to be too small to easily quantify. As the open coding doesn't help with eliminating control flow dependencies, so my idea is to encode the sort network comparison order in an array and use that to drive a simple loop. The code size would be pretty similar to insertion sort and the loop overhead should mostly be hidden by the CPU OoO machinery. Probably won't help much, but would be interesting and simple enough to try out. Can you share you code for the benchmark so I can try it out? Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers