Hi all, I have been continue doing testing of parallel create index patch. So far I haven't came across any sort of issue or regression with the patches. Here are few performance number for the latest round of testing - which is perform on top of 6th Jan patch submitted by Peter.
Testing is done on openstack instance with: CUP: 8 RAM : 16GB HD: 640 GB postgres=# select pg_size_pretty(pg_total_relation_size ('lineitem')); pg_size_pretty ---------------- 93 GB (1 row) -- Test 1. max_parallel_workers_maintenance = 2 max_parallel_workers = 16 max_parallel_workers_per_gather = 8 maintenance_work_mem = 1GB max_wal_size = 4GB -- Test 2. max_parallel_workers_maintenance = 4 max_parallel_workers = 16 max_parallel_workers_per_gather = 8 maintenance_work_mem = 2GB max_wal_size = 4GB -- Test 3. max_parallel_workers_maintenance = 8 max_parallel_workers = 16 max_parallel_workers_per_gather = 8 maintenance_work_mem = 4GB max_wal_size = 4GB NOTE: All the time taken entries are the median of 3 consecutive runs for the same B-tree index creation query. Time taken for Parallel Index createion: Test 1 Test 2 Test 3 Simple/Composite Indexes: Without patch With patch , max_parallel_workers_maintenance = 2 % Change Without patch With patch, max_parallel_workers_maintenance = 4 % Change Without patch With patch, max_parallel_workers_maintenance = 8 % Change Index on "bigint" column: CREATE INDEX li_ordkey_idx1 ON lineitem(l_orderkey); 1062446.462 ms (17:42.446) 1024972.273 ms (17:04.972) 3.52 % 1053468.945 ms (17:33.469) 896375.543 ms (14:56.376) 17.75 % 1082920.703 ms (18:02.921) 932550.058 ms (15:32.550) 13.88 % index on "integer" column: CREATE INDEX li_lineno_idx2 ON lineitem(l_linenumber); 1538285.499 ms (25:38.285) 1201008.423 ms (20:01.008) 21.92 % 1529837.023 ms (25:29.837) 1014188.489 ms (16:54.188) 33.70 % 1642160.947 ms (27:22.161) 978518.253 ms (16:18.518) 40.41 % index on "numeric" column: CREATE INDEX li_qty_idx3 ON lineitem(l_quantity); 3968102.568 ms (01:06:08.103) 2359304.405 ms (39:19.304) 40.54 % 4129510.930 ms (01:08:49.511) 1680201.644 ms (28:00.202) 59.31 % 4348248.210 ms (01:12:28.248) 1490461.879 ms (24:50.462) 65.72 % index on "character" column: CREATE INDEX li_lnst_idx4 ON lineitem(l_linestatus); 1510273.931 ms (25:10.274) 1240265.301 ms (20:40.265) 17.87 % 1516842.985 ms (25:16.843) 995730.092 ms (16:35.730) 34.35 % 1580789.375 ms (26:20.789) 984975.746 ms (16:24.976) 37.69 % index on "date" column: CREATE INDEX li_shipdt_idx5 ON lineitem(l_shipdate); 1483603.274 ms (24:43.603) 1189704.930 ms (19:49.705) 19.80 % 1498348.925 ms (24:58.349) 1040421.626 ms (17:20.422) 30.56 % 1653651.499 ms (27:33.651) 1016305.794 ms (16:56.306) 38.54 % index on "character varying" column: CREATE INDEX li_comment_idx6 ON lineitem(l_comment); 6945953.838 ms (01:55:45.954) 4329696.334 ms (01:12:09.696) 37.66 % 6818556.437 ms (01:53:38.556) 2834034.054 ms (47:14.034) 58.43 % 6942285.711 ms (01:55:42.286) 2648430.902 ms (44:08.431) 61.85 % composite index on "numeric", "character" columns: CREATE INDEX li_qtylnst_idx34 ON lineitem (l_quantity, l_linestatus); 4961563.400 ms (01:22:41.563) 2959722.178 ms (49:19.722) 40.34 % 5242809.501 ms (01:27:22.810) 2077463.136 ms (34:37.463) 60.37 % 5576765.727 ms (01:32:56.766) 1755829.420 ms (29:15.829) 68.51 % composite index on "date", "character varying" columns: CREATE INDEX li_shipdtcomment_idx56 ON lineitem (l_shipdate, l_comment); 4693318.077 ms (01:18:13.318) 3181494.454 ms (53:01.494) 32.21 % 4627624.682 ms (01:17:07.625) 2613289.211 ms (43:33.289) 43.52 % 4719242.965 ms (01:18:39.243) 2685516.832 ms (44:45.517) 43.09 % *Thanks & Regards,* *Prabhat Kumar Sahu* Skype ID: prabhat.sahu1984 www.enterprisedb.co <http://www.enterprisedb.com/>m <http://www.enterprisedb.com/> On Tue, Jan 16, 2018 at 6:24 AM, Peter Geoghegan <p...@bowt.ie> wrote: > On Fri, Jan 12, 2018 at 10:28 AM, Robert Haas <robertmh...@gmail.com> > wrote: > > More comments: > > Attached patch has all open issues worked through, including those > that I respond to or comment on below, as well as the other feedback > from your previous e-mails. Note also that I fixed the issue that Amit > raised, as well as the GetOldestXmin()-argument bug that I noticed in > passing when responding to Amit. I also worked on the attribution in > the commit message. > > Before getting to my responses to your most recent round of feedback, > I want to first talk about some refactoring that I decided to do. As > you can see from the master branch, tuplesort_performsort() isn't > necessarily reached for spool2, even when we start out with a spool2 > (that is, for many unique index builds, spool2 never even does a > tuplesort_performsort()). We may instead decide to shut down spool2 > when it has no (dead) tuples. I made this work just as well for the > parallel case in this latest revision. I had to teach tuplesort.c to > accept an early tuplesort_end() for LEADER() -- it had to be prepared > to release still-waiting workers in some cases, rather than depending > on nbtsort.c having called tuplesort_performsort() already. Several > routines within nbtsort.c that previously knew something about > parallelism now know nothing about it. This seems like a nice win. > > Separately, I took advantage of the fact that within the leader, its > *worker* Tuplesortstate can safely call tuplesort_end() before the > leader state's tuplesort_performsort() call. > > The overall effect of these two changes is that there is now a > _bt_leader_heapscan() call for the parallel case that nicely mirrors > the serial case's IndexBuildHeapScan() call, and once we're done with > populating spools, no subsequent code needs to know a single thing > about parallelism as a special case. You may notice some small changes > to the tuplesort.h overview, which now advertises that callers can > take advantage of this leeway. > > Now on to my responses to your most recent round of feeback... > > > BufFileView() looks fairly pointless. It basically creates a copy of > > the input and, in so doing, destroys the input, which is a lot like > > returning the input parameter except that it uses more cycles. It > > does do a few things. > > While it certainly did occur to me that that was kind of weird, and I > struggled with it on my own for a little while, I ultimately agreed > with Thomas that it added something to have ltsConcatWorkerTapes() > call some buffile function in every iteration of its loop. > (BufFileView() + BufFileViewAppend() are code that Thomas actually > wrote, though I added the asserts and comments myself.) > > If you think about this in terms of the interface rather than the > implementation, then it may make more sense. The encapsulation adds > something which might pay off later, such as when extendBufFile() > needs to work with a concatenated set of BufFiles. And even right now, > I cannot simply reuse the BufFile without then losing the assert that > is currently in BufFileViewAppend() (must not have associated shared > fileset assert). So I'd end up asserting less (rather than more) there > if BufFileView() was removed. > > It wastes some cycles to not simply use the BufFile directly, but not > terribly many in the grand scheme of things. This happens once per > external sort operation. > > > In miscadmin.h, I'd put the prototype for the new GUC next to > > max_worker_processes, not maintenance_work_mem. > > But then I'd really have to put it next to max_worker_processes in > globals.c, too. That would mean that it would go under "Primary > determinants of sizes of shared-memory structures" within globals.c, > which seems wrong to me. What do you think? > > > The ereport() in index_build will, I think, confuse people when it > > says that there are 0 parallel workers. I suggest splitting this into > > two cases: if (indexInfo->ii_ParallelWorkers == 0) ereport(... > > "building index \"%s\" on table \"%s\" serially" ...) else ereport(... > > "building index \"%s\" on table \"%s\" in parallel with request for %d > > parallel workers" ...). > > WFM. I've simply dropped any reference to leader participation in the > messages here, to keep things simple. This seemed okay because the > only thing that affects leader participation is the > parallel_leader_participation GUC, which is under the user's direct > control at all times, and is unlikely to be changed. Those that really > want further detail have trace_sort for that. > > > The logic in IndexBuildHeapRangeScan() around need_register_snapshot > > and OldestXmin seems convoluted and not very well-edited to me. > > Having revisited it, I now agree that the code added to > IndexBuildHeapRangeScan() was unclear, primarily in that the > need_unregister_snapshot local variable was overloaded in a weird way. > > > I suggest that you go back to the original code organization > > and then just insert an additional case for a caller-supplied scan, so > > that the overall flow looks like this: > > > > if (scan != NULL) > > { > > ... > > } > > else if (IsBootstrapProcessingMode() || indexInfo->ii_Concurrent) > > { > > ... > > } > > else > > { > > ... > > } > > The problem that I see with this alternative flow is that the "if > (scan != NULL)" and the "else if (IsBootstrapProcessingMode() || > indexInfo->ii_Concurrent)" blocks clearly must contain code for two > distinct, non-overlapping cases, despite the fact those two cases > actually do overlap somewhat. That is, a call to > IndexBuildHeapRangeScan() may have a (parallel) heap scan argument > (control reaches your first code block), or may not (control reaches > your second or third code block). At the same time, a call to > IndexBuildHeapRangeScan() may use SnapShotAny (ordinary CREATE INDEX), > or may need an MVCC snapshot (either by registering its own, or using > the parallel one). These two things are orthogonal. > > I think I still get the gist of what you're saying, though. I've come > up with a new structure that is a noticeable improvement on what I > had. Importantly, the new structure let me add a number of > parallelism-agnostic asserts that make sure that every ambuild routine > that supports parallelism gets the details right. > > > Along with that, I'd change the name of need_register_snapshot to > > need_unregister_snapshot (it's doing both jobs right now) and > > initialize it to false. > > Done. > > > + * This support code isn't reliable when called from within a parallel > > + * worker process due to the fact that our state isn't propagated. > This is > > + * why parallel index builds are disallowed on catalogs. It is possible > > + * that we'll fail to catch an attempted use of a user index undergoing > > + * reindexing due the non-propagation of this state to workers, which > is not > > + * ideal, but the problem is not particularly likely to go undetected > due to > > + * our not doing better there. > > > > I understand the first two sentences, but I have no idea what the > > third one means, especially the part that says "not particularly > > likely to go undetected due to our not doing better there". It sounds > > scary that something bad is only "not particularly likely to go > > undetected"; don't we need to detect bad things reliably? > > The primary point here, that you said you understood, is that we > definitely need to detect when we're reindexing a catalog index within > the backend, so that systable_beginscan() can do the right thing and > not use the index (we also must avoid assertion failures). My solution > to that problem is, of course, to not allow the use of parallel create > index when REINDEXing a system catalog. That seems 100% fine to me. > > There is a little bit of ambiguity about other cases, though -- that's > the secondary point I tried to make within that comment block, and the > part that you took issue with. To put this secondary point another > way: It's possible that we'd fail to detect it if someone's comparator > went bananas and decided it was okay to do SQL access (that resulted > in an index scan of the index undergoing reindex). That does seem > rather unlikely, but I felt it necessary to say something like this > because ReindexIsProcessingIndex() isn't already something that only > deals with catalog indexes -- it works with all indexes. > > Anyway, I reworded this. I hope that what I came up with is clearer than > before. > > > But also, > > you used the word "not" three times and also the prefix "un-", meaning > > "not", once. Four negations in 13 words! Perhaps I'm not entirely in > > a position to cast aspersions on overly-complex phraseology -- the pot > > calling the kettle black and all that -- but I bet that will be a lot > > clearer if you reduce the number of negations to either 0 or 1. > > You're not wrong. Simplified. > > > The comment change in standard_planner() doesn't look helpful to me; > > I'd leave it out. > > Okay. > > > + * tableOid is the table that index is to be built on. indexOid is the > OID > > + * of a index to be created or reindexed (which must be a btree index). > > > > I'd rewrite that first sentence to end "the table on which the index > > is to be built". The second sentence should say "an index" rather > > than "a index". > > Okay. > > > But, actually, I think we would be better off just ripping > > leaderWorker/leaderParticipates out of this function altogether. > > compute_parallel_worker() is not really under any illusion that it's > > computing a number of *participants*; it's just computing a number of > > *workers*. > > That distinction does seem to cause plenty of confusion. While I > accept what you say about compute_parallel_worker(), I still haven't > gone as far as removing the leaderParticipates argument altogether, > because compute_parallel_worker() isn't the only thing that matters > here. (More on that below.) > > > I think it's fine for > > parallel_leader_participation=off to simply mean that you get one > > fewer participants. That's actually what would happen with parallel > > query, too. Parallel query would consider > > parallel_leader_participation later, in get_parallel_divisor(), when > > working out the cost of one path vs. another, but it doesn't use it to > > choose the number of workers. So it seems to me that getting rid of > > all of the workerLeader considerations will make it both simpler and > > more consistent with what we do for queries. > > I was aware of those details, and figured that parallel query fudges > the compute_parallel_worker() figure's leader participation in some > sense, and that that was what I needed to compensate for. After all, > when parallel_leader_participation=off, having > compute_parallel_worker() return 1 means rather a different thing to > what it means with parallel_leader_participation=on, even though in > general we seem to assume that parallel_leader_participation can only > make a small difference overall. > > Here's what I've done based on your feedback: I've changed the header > comments, but stopped leaderParticipates from affecting the > compute_parallel_worker() calculation (so, as I said, > leaderParticipates stays). The leaderParticipates argument continues > to affect these two aspects of plan_create_index_workers()'s return > value: > > 1. It continues to be used so we have a total number of participants > (not workers) to apply our must-have-32MB-workMem limit on > participants. > > Parallel query has no equivalent of this, and it seems warranted. Note > that this limit is no longer applied when parallel_workers storage > param was set, as discussed. > > 2. I continue to use the leaderParticipates argument to disallow the > case where there is only one CREATE INDEX participant but parallelism > is in use, because, of course, that clearly makes no sense -- we > should just use a serial sort instead. > > (It might make sense to allow this if parallel_leader_participation > was *purely* a testing GUC, only for use by by backend hackers, but > AFAICT it isn't.) > > The planner can allow a single participant parallel sequential scan > path to be created without worrying about the fact that that doesn't > make much sense, because a plan with only one parallel participant is > always going to cost more than some serial plan (you will only get a 1 > participant parallel sequential scan when force_parallel_mode is on). > Obviously plan_create_index_workers() doesn't generate (partial) paths > at all, so I simply have to get the same outcome (avoiding a senseless > 1 participant parallel operation) some other way here. > > > If you have an idea how to make a better > > cost model than this for CREATE INDEX, I'm willing to consider other > > options. If you don't, or want to propose that as a follow-up patch, > > then I think it's OK to use what you've got here for starters. I just > > don't want it to be more baroque than necessary. > > I suspect that the parameters of any cost model for parallel CREATE > INDEX that we're prepared to consider for v11 are: "Use a number of > parallel workers that is one below the number at which the total > duration of the CREATE INDEX either stays the same or goes up". > > It's hard to do much better than this within those parameters. I can > see a fairly noticeable benefit to parallelism with 4 parallel workers > and a measly 1MB of maintenance_work_mem (when parallelism is forced) > relative to the serial case with the same amount of memory. At least > on my laptop, it seems to be rather hard to lose relative to a serial > sort when using parallel CREATE INDEX (to be fair, I'm probably > actually using way more memory than 1MB to do this due to FS cache > usage). I can think of a cleverer approach to costing parallel CREATE > INDEX, but it's only cleverer by weighing distributed costs. Not very > relevant, for the time being. > > BTW, the 32MB per participant limit within plan_create_index_workers() > was chosen based on the realization that any higher value would make > having a default setting of 2 for max_parallel_maintenance_workers (to > match the max_parallel_workers_per_gather default) pointless when the > default maintenance_work_mem value of 64MB is in use. That's not > terribly scientific, though it at least doesn't come at the expense of > a more scientific idea for a limit like that (I don't actually have > one, you see). I am merely trying to avoid being *gratuitously* > wasteful of shared resources that are difficult to accurately cost in > (e.g., the distributed cost of random I/O to the system as a whole > when we do a parallel index build while ridiculously low on > maintenance_work_mem). > > > I think that the naming of the wait events could be improved. Right > > now, they are named by which kind of process does the waiting, but it > > really should be named based on what the thing for which we're > > waiting. I also suggest that we could just write Sort instead of > > Tuplesort. In short, I suggest ParallelTuplesortLeader -> > > ParallelSortWorkersDone and ParallelTuplesortLeader -> > > ParallelSortTapeHandover. > > WFM. Also added documentation for the wait events to monitoring.sgml, > which I somehow missed the first time around. > > > Not for this patch, but I wonder if it might be a worthwhile future > > optimization to allow workers to return multiple tapes to the leader. > > One doesn't want to go crazy with this, of course. If the worker > > returns 100 tapes, then the leader might get stuck doing multiple > > merge passes, which would be a foolish way to divide up the labor, and > > even if that doesn't happen, Amdahl's law argues for minimizing the > > amount of work that is not done in parallel. Still, what if a worker > > (perhaps after merging) ends up with 2 or 3 tapes? Is it really worth > > merging them so that the leader can do a 5-way merge instead of a > > 15-way merge? > > I did think about this myself, or rather I thought specifically about > building a serial/bigserial PK during pg_restore, a case that must be > very common. The worker merges for such an index build will typically > be *completely pointless* when all input runs are in sorted order, > because the merge heap will only need to consult the root of the heap > and its two immediate children throughout (commit 24598337c helped > cases like this enormously). You might as well merge hundreds of runs > in the leader, provided you still have enough memory per tape that you > can get the full benefit of OS readahead (this is not that hard when > you're only going to repeatedly read from the same tape anyway). > > I'm not too worried about it, though. The overall picture is still > very positive even in this case. The "extra worker merging" isn't > generally a big proportion of the overall cost, especially there. More > importantly, if I tried to do better, it would be the "quicksort with > spillover" cost model story all over again (remember how tedious that > was?). How hard are we prepared to work to ensure that we get it right > when it comes to skipping worker merging, given that users always pay > some overhead, even when that doesn't happen? > > Note also that parallel index builds manage to unfairly *gain* > advantage over serial cases (they have the good variety of dumb luck, > rather than the bad variety) in certain other common cases. This > happens with an *inverse* physical/logical correlation (e.g. a DESC > index builds on a date field). They manage to artificially do better > than theory would predict, simply because a greater number of smaller > quicksorts are much faster during initial run generation, without also > taking a concomitant performance hit at merge time. Thomas showed this > at one point. Note that even that's only true because of the qsort > precheck (what I like to call the "banana skin prone" precheck, that > we added to our qsort implementation in 2006) -- it would be true for > *all* correlations, but that one precheck thing complicates matters. > > All of this is a tricky business, and that isn't going to get any easier > IMV. > > > + * Make sure that the temp file(s) underlying the tape set are > created in > > + * suitable temp tablespaces. This is only really needed for serial > > + * sorts. > > > > This comment makes me wonder whether it is "sorta" needed for parallel > sorts. > > I removed "really". The point of the comment is that we've already set > up temp tablespaces for the shared fileset in the parallel case. > Shared filesets figure out which tablespaces will be used up-front -- > see SharedFileSetInit(). > > > - if (trace_sort) > > + if (trace_sort && !WORKER(state)) > > > > I have a feeling we still want to get this output even from workers, > > but maybe I'm missing something. > > I updated tuplesort_end() so that trace_sort reports on the end of the > sort, even for worker processes. (We still don't show generic > tuplesort_begin* message for workers, though.) > > > + arg5 indicates serial, parallel worker, or parallel leader > sort.</entry> > > > > I think it should say what values are used for each case. > > I based this on "arg0 indicates heap, index or datum sort", where it's > implied that the values are respective to the order that they appear > in in the sentence (starting from 0). But okay, I'll do it that way > all the same. > > > + /* Release worker tuplesorts within leader process as soon as > possible */ > > > > IIUC, the worker tuplesorts aren't really holding onto much of > > anything in terms of resources. I think it might be better to phrase > > this as /* The sort we just did absorbed the final tapes produced by > > these tuplesorts, which are of no further use. */ or words to that > > effect. > > Okay. Done that way. > > > Instead of making a special case in CreateParallelContext for > > serializable_okay, maybe index_build should just use SetConfigOption() > > to force the isolation level to READ COMMITTED right after it does > > NewGUCNestLevel(). The change would only be temporary because the > > subsequent call to AtEOXact_GUC() will revert it. > > I tried doing it that way, but it doesn't seem workable: > > postgres=# begin transaction isolation level serializable ; > BEGIN > postgres=*# reindex index test_unique; > ERROR: 25001: SET TRANSACTION ISOLATION LEVEL must be called before any > query > LOCATION: call_string_check_hook, guc.c:9953 > > Note that AutoVacLauncherMain() uses SetConfigOption() to set/modify > default_transaction_isolation -- not transaction_isolation. > > Instead, I added a bit more to comments within > CreateParallelContext(), to justify what I've done along the lines you > went into. Hopefully this works better for you. > > > There is *still* more to review here, but my concentration is fading. > > If you could post an updated patch after adjusting for the comments > > above, I think that would be helpful. I'm not totally out of things > > to review that I haven't already looked over once, but I think I'm > > close. > > I'm impressed with how quickly you're getting through review of the > patch. Hopefully we can keep that momentum up. > > Thanks > -- > Peter Geoghegan >