Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Mon, May 22, 2017 at 6:39 AM, Thomas Munrowrote: > On Thu, Apr 27, 2017 at 11:03 AM, Thomas Munro > wrote: >> On Thu, Apr 27, 2017 at 5:13 AM, Oleg Golovanov wrote: >>> Can you actualize your patch set? The error got from >>> 0010-hj-parallel-v12.patch. >> >> I really should get around to setting up a cron job to tell me about >> that. Here's a rebased version. > > Rebased. Rebased for the recent re-indent and shm_toc API change; no functional changes in this version. (I have a new patch set in the pipeline adding the skew optimisation and some other things, more on that soon.) -- Thomas Munro http://www.enterprisedb.com parallel-shared-hash-v15.patchset.tgz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Thu, Apr 27, 2017 at 11:03 AM, Thomas Munrowrote: > On Thu, Apr 27, 2017 at 5:13 AM, Oleg Golovanov wrote: >> Can you actualize your patch set? The error got from >> 0010-hj-parallel-v12.patch. > > I really should get around to setting up a cron job to tell me about > that. Here's a rebased version. Rebased. -- Thomas Munro http://www.enterprisedb.com parallel-shared-hash-v14.tgz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Thu, Apr 27, 2017 at 5:13 AM, Oleg Golovanovwrote: > Can you actualize your patch set? The error got from > 0010-hj-parallel-v12.patch. I really should get around to setting up a cron job to tell me about that. Here's a rebased version. The things currently on my list for this patch are: 1. Implement the skew optimisation. 2. Consider Andres's suggestion of splitting MultiExecHash into two functions, serial and parallel version, rather than having all those conditional blocks in there. 3. Figure out whether the shared BufFile stuff I propose would work well for Peter Geoghegan's parallel tuple sort patch, by trying it (I've made a start, more soon). 4. Figure out how the costing model needs to be tweaked, probably based on experimentation. I'm taking a short break to work on other things right now but will post a version with those changes soon. -- Thomas Munro http://www.enterprisedb.com parallel-shared-hash-v13.tgz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [HACKERS] WIP: [[Parallel] Shared] Hash
Hi. Thanks for rebased patch set v12. Currently I try to use this patch on my new test site and get following: Hmm... The next patch looks like a unified diff to me... The text leading up to this was: -- |diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h |index bdf15621c83..e9db8880161 100644 |--- a/src/include/access/parallel.h |+++ b/src/include/access/parallel.h -- patching file src/include/access/parallel.h Using Plan A... Hunk #1 FAILED at 58. 1 out of 1 hunk FAILED -- saving rejects to file src/include/access/parallel.h.rej Can you actualize your patch set? The error got from 0010-hj-parallel-v12.patch. Best Regards, Oleg Golovanov Moscow, Russia >Четверг, 13 апреля 2017, 13:49 +03:00 от Thomas Munro >: > >On Thu, Apr 13, 2017 at 10:04 PM, Oleg Golovanov < rent...@mail.ru > wrote: >> bash-4.2$ grep Hunk *.log | grep FAILED >> 0005-hj-leader-gate-v11.patch.log:Hunk #1 FAILED at 14. >> 0010-hj-parallel-v11.patch.log:Hunk #2 FAILED at 2850. >> 0010-hj-parallel-v11.patch.log:Hunk #1 FAILED at 21. >> 0010-hj-parallel-v11.patch.log:Hunk #3 FAILED at 622. >> 0010-hj-parallel-v11.patch.log:Hunk #6 FAILED at 687. >> 0010-hj-parallel-v11.patch.log:Hunk #1 FAILED at 21. >> 0010-hj-parallel-v11.patch.log:Hunk #3 FAILED at 153. > >Hi Oleg > >Thanks for looking at this. It conflicted with commit 9c7f5229. Here >is a rebased patch set. > >This version also removes some code for dealing with transient record >types which didn't work out. I'm trying to deal with that problem >separately[1] and in a general way so that the parallel hash join >patch doesn't have to deal with it at all. > >[1] >https://www.postgresql.org/message-id/CAEepm=0ZtQ-SpsgCyzzYpsXS6e=kzwqk3g5ygn3mdv7a8da...@mail.gmail.com > >-- >Thomas Munro >http://www.enterprisedb.com > >-- >Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) >To make changes to your subscription: >http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Thu, Apr 13, 2017 at 10:04 PM, Oleg Golovanovwrote: > bash-4.2$ grep Hunk *.log | grep FAILED > 0005-hj-leader-gate-v11.patch.log:Hunk #1 FAILED at 14. > 0010-hj-parallel-v11.patch.log:Hunk #2 FAILED at 2850. > 0010-hj-parallel-v11.patch.log:Hunk #1 FAILED at 21. > 0010-hj-parallel-v11.patch.log:Hunk #3 FAILED at 622. > 0010-hj-parallel-v11.patch.log:Hunk #6 FAILED at 687. > 0010-hj-parallel-v11.patch.log:Hunk #1 FAILED at 21. > 0010-hj-parallel-v11.patch.log:Hunk #3 FAILED at 153. Hi Oleg Thanks for looking at this. It conflicted with commit 9c7f5229. Here is a rebased patch set. This version also removes some code for dealing with transient record types which didn't work out. I'm trying to deal with that problem separately[1] and in a general way so that the parallel hash join patch doesn't have to deal with it at all. [1] https://www.postgresql.org/message-id/CAEepm=0ZtQ-SpsgCyzzYpsXS6e=kzwqk3g5ygn3mdv7a8da...@mail.gmail.com -- Thomas Munro http://www.enterprisedb.com parallel-shared-hash-v12.tgz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [HACKERS] WIP: [[Parallel] Shared] Hash
Hi. I got errors of patching on CentOS 7: bash-4.2$ grep Hunk *.log | grep FAILED 0005-hj-leader-gate-v11.patch.log:Hunk #1 FAILED at 14. 0010-hj-parallel-v11.patch.log:Hunk #2 FAILED at 2850. 0010-hj-parallel-v11.patch.log:Hunk #1 FAILED at 21. 0010-hj-parallel-v11.patch.log:Hunk #3 FAILED at 622. 0010-hj-parallel-v11.patch.log:Hunk #6 FAILED at 687. 0010-hj-parallel-v11.patch.log:Hunk #1 FAILED at 21. 0010-hj-parallel-v11.patch.log:Hunk #3 FAILED at 153. What is wrong? The sources were clean: bash-4.2$ git status # On branch master nothing to commit, working directory clean I was patching by the command: patch -b -i ../.patches/parallel-shared-hash-v11/0001-hj-refactor-memory-accounting-v11.patch -p1 --verbose > ../.patches/parallel-shared-hash-v11/0001-hj-refactor-memory-accounting-v11.patch.log patch -b -i ../.patches/parallel-shared-hash-v11/0002-hj-refactor-batch-increases-v11.patch -p1 --verbose > ../.patches/parallel-shared-hash-v11/0002-hj-refactor-batch-increases-v11.patch.log patch -b -i ../.patches/parallel-shared-hash-v11/0003-hj-refactor-unmatched-v11.patch -p1 --verbose > ../.patches/parallel-shared-hash-v11/0003-hj-refactor-unmatched-v11.patch.log patch -b -i ../.patches/parallel-shared-hash-v11/0004-hj-barrier-v11.patch -p1 --verbose > ../.patches/parallel-shared-hash-v11/0004-hj-barrier-v11.patch.log patch -b -i ../.patches/parallel-shared-hash-v11/0005-hj-leader-gate-v11.patch -p1 --verbose > ../.patches/parallel-shared-hash-v11/0005-hj-leader-gate-v11.patch.log patch -b -i ../.patches/parallel-shared-hash-v11/0006-hj-let-node-have-seg-in-worker-v11.patch -p1 --verbose > ../.patches/parallel-shared-hash-v11/0006-hj-let-node-have-seg-in-worker-v11.patch.log patch -b -i ../.patches/parallel-shared-hash-v11/0007-hj-remove-buf-file-is-temp-v11.patch -p1 --verbose > ../.patches/parallel-shared-hash-v11/0007-hj-remove-buf-file-is-temp-v11.patch.log patch -b -i ../.patches/parallel-shared-hash-v11/0008-hj-buf-file-set-v11.patch -p1 --verbose > ../.patches/parallel-shared-hash-v11/0008-hj-buf-file-set-v11.patch.log patch -b -i ../.patches/parallel-shared-hash-v11/0009-hj-shared-tuplestore-v11.patch -p1 --verbose > ../.patches/parallel-shared-hash-v11/0009-hj-shared-tuplestore-v11.patch.log patch -b -i ../.patches/parallel-shared-hash-v11/0010-hj-parallel-v11.patch -p1 --verbose > ../.patches/parallel-shared-hash-v11/0010-hj-parallel-v11.patch.log Best Regards, Oleg Golovanov Moscow, Russia >Вторник, 4 апреля 2017, 0:28 +03:00 от Thomas Munro >: > >On Tue, Apr 4, 2017 at 9:11 AM, Andres Freund < and...@anarazel.de > wrote: >> Hi, >> >> On 2017-03-31 17:53:12 +1300, Thomas Munro wrote: >>> Thanks very much to Rafia for testing, and to Andres for his copious >>> review feedback. Here's a new version. Changes: >> >> I unfortunately think it's too late to get this into v10. There's still >> heavy development going on, several pieces changed quite noticeably >> since the start of the CF and there's still features missing. Hence I >> think this unfortunately has to be pushed - as much as I'd have liked to >> have this in 10. >> >> Do you agree? > >Agreed. > >Thank you very much Andres, Ashutosh, Peter, Rafia and Robert for all >the review, testing and discussion so far. > >-- >Thomas Munro >http://www.enterprisedb.com >
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Tue, Apr 4, 2017 at 9:11 AM, Andres Freundwrote: > Hi, > > On 2017-03-31 17:53:12 +1300, Thomas Munro wrote: >> Thanks very much to Rafia for testing, and to Andres for his copious >> review feedback. Here's a new version. Changes: > > I unfortunately think it's too late to get this into v10. There's still > heavy development going on, several pieces changed quite noticeably > since the start of the CF and there's still features missing. Hence I > think this unfortunately has to be pushed - as much as I'd have liked to > have this in 10. > > Do you agree? Agreed. Thank you very much Andres, Ashutosh, Peter, Rafia and Robert for all the review, testing and discussion so far. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
Hi, On 2017-03-31 17:53:12 +1300, Thomas Munro wrote: > Thanks very much to Rafia for testing, and to Andres for his copious > review feedback. Here's a new version. Changes: I unfortunately think it's too late to get this into v10. There's still heavy development going on, several pieces changed quite noticeably since the start of the CF and there's still features missing. Hence I think this unfortunately has to be pushed - as much as I'd have liked to have this in 10. Do you agree? Regards, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
Hi Thomas, On 2017-03-31 17:53:12 +1300, Thomas Munro wrote: > Thanks very much to Rafia for testing, and to Andres for his copious > review feedback. Here's a new version. Changes: I've not looked at that aspect, but one thing I think would be good is to first add patch that increases coverage of nodeHash[join].c to nearly 100%. There's currently significant bits of nodeHash.c that aren't covered (skew optimization, large tuples). https://coverage.postgresql.org/src/backend/executor/nodeHash.c.gcov.html https://coverage.postgresql.org/src/backend/executor/nodeHashjoin.c.gcov.html - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
Hi hackers, Thanks very much to Rafia for testing, and to Andres for his copious review feedback. Here's a new version. Changes: 1. Keep all the backing files that are part of a BufFileSet in subdirectories, as suggested by Andres. Now, instead of that unpopular logic for scanning ranges of possible file paths to delete, we can just blow away whole directories that group sets of related files. 2. Don't expose 'participant' and 'partition' concepts, Andres didn't like much, in the BufFile API. There is a new concept 'stripe' which client code of BufFileSet can use to specify the participant number in a more general way without saying so: it's really just a way to spread files over tablespaces. I'm not sure if tablespaces are really used that way much, but it seemed like Peter wasn't going to be too happy with a proposal that didn't do *something* to respect the existing temp_tablespaces GUC beahviour (and he'd be right). But I didn't think it would make any kind of sense at all to stripe by 1GB segments as private BufFiles do when writing from multiple processes, as I have argued elsewhere, hence this scheme. The 'qunique' function used here (basically poor man's std::unique) is one I proposed earlier, with the name suggested by Tom Lane: See https://www.postgresql.org/message-id/flat/CAEepm%3D2vmFTNpAmwbGGD2WaryM6T3hSDVKQPfUwjdD_5XY6vAA%40mail.gmail.com . 3. Merged the single-batch and multi-batch patches into one. EarlierI had the idea that it was easier to review them in layers since I hoped people might catch a glimpse of the central simplicity without being hit by a wall of multi-batch logic, but since Andres is reviewing and disagrees, I give you 0010-hj-parallel-v11.patch which weighs in at 32 files changed, 2278 insertions(+), 250 deletions(-). 4. Moved the DSM handling to the every end of resowner.c's cleanup. Peter pointed out that it would otherwise happen before fd.c Files are closed. He was concerned about a different aspect of that which I'm not sure I fully understand, but at the very least it seemed to represent a significant problem for this design on Windows. I discussed this briefly with Robert off-list and he told me that there is probably no good reason for the ordering that we have, and what's more, there may be good arguments even outside this case for DSM segments being cleaned up as late as possible, now that they contain shared control information and not just tuple data as once had been imagined. I can't think of any reason why this would not be safe. Can you? 5. The empty inner relation optimisation implemented. Some smaller changes and miles of feedback inline below: On Mon, Mar 27, 2017 at 11:03 AM, Thomas Munrowrote: > On Mon, Mar 27, 2017 at 9:41 AM, Andres Freund wrote: >> SharedBufFile allows temporary files to be created by one backend and >> then exported for read-only access by other backends, with clean-up >> managed by reference counting associated with a DSM segment. This includes >> changes to fd.c and buffile.c to support new kinds of temporary file. >> >> >> diff --git a/src/backend/storage/file/buffile.c >> b/src/backend/storage/file/buffile.c >> index 4ca0ea4..a509c05 100644 >> --- a/src/backend/storage/file/buffile.c >> +++ b/src/backend/storage/file/buffile.c >> >> I think the new facilities should be explained in the file's header. > > Will do. Done. >> @@ -68,9 +71,10 @@ struct BufFile >> * avoid making redundant FileSeek calls. >> */ >> >> - boolisTemp; /* can only add files if >> this is TRUE */ >> + boolisSegmented;/* can only add files if this is >> TRUE */ >> >> That's a bit of a weird and uncommented upon change. > > I was trying to cut down on the number of places we use the word > 'temporary' to activate various different behaviours. In this case, > the only thing it controls is whether the BufFile is backed by one > single fd.c File or many segments, so I figured it should be renamed. > > As Peter and you have pointed out, there may be a case for removing it > altogether. Done in 0007-hj-remove-buf-file-is-temp-v11.patch. >> @@ -79,6 +83,8 @@ struct BufFile >> */ >> ResourceOwner resowner; >> >> + BufFileTag tag;/* for discoverability >> between backends */ >> >> Not perfectly happy with the name tag here, the name is a bit too >> similar to BufferTag - something quite different. > > Yeah, will rename. Done. That existed only because I had sharedbuffile.c which needed special access to buffile.c via those weird 'tag' interfaces. In the new version that isn't required, and a new struct BufFileSet is provided by buffile.c/h. >> +static void >> +make_tagged_path(char *tempdirpath, char *tempfilepath, >> +const BufFileTag *tag, int segment) >> +{ >> + if (tag->tablespace == DEFAULTTABLESPACE_OID || >> +
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Tue, Mar 28, 2017 at 11:11 AM, Rafia Sabihwrote: > On Mon, Mar 27, 2017 at 12:20 PM, Thomas Munro > wrote: >> >> On Sun, Mar 26, 2017 at 3:56 PM, Thomas Munro >> wrote: >> > But... what you said above must be a problem for Windows. I believe >> > it doesn't allow files to be unlinked if they are open, and I see that >> > DSM segments are cleaned up in resowner's phase == >> > RESOURCE_RELEASE_BEFORE_LOCKS and files are closed in phase == >> > RESOURCE_RELEASE_AFTER_LOCKS. >> >> I thought this last point about Windows might be fatal to my design, >> but it seems that Windows since at least version 2000 has support for >> Unixoid unlinkability via the special flag FILE_SHARE_DELETE. > > On testing v10 of this patch over commit > b54aad8e34bd6299093e965c50f4a23da96d7cc3 and applying the tweak > mentioned in [1], for TPC-H queries I found the results quite > encouraging, > > Experimental setup: > TPC-H scale factor - 20 > work_mem = 1GB > shared_buffers = 10GB > effective_cache_size = 10GB > random_page_cost = seq_page_cost = 0.1 > max_parallel_workers_per_gather = 4 > > Performance numbers: > (Time in seconds) > Query | Head | Patch | > --- > Q3 | 73 | 37 | > Q5 | 56 | 31 | > Q7 | 40 | 30 | > Q8 | 8 | 8| > Q9 | 85 | 42 | > Q10 | 86 | 46 | > Q14 | 11 | 6| > Q16 | 32 | 11 | > Q21 | 53 | 56 | > > Please find the attached file for the explain analyse output of these > queries on head as well as patch. > Would be working on analysing the performance of this patch on 300 scale > factor. > > [1] > https://www.postgresql.org/message-id/flat/CAEepm%3D270ze2hVxWkJw-5eKzc3AB4C9KpH3L2kih75R5pdSogg%40mail.gmail.com > -- Before moving to higher scale I tried playing around work_mem effects on this patch and came across following results, All settings are kept as before with the exception of work_mem that is set to 64MB. Most of the queries showed similar performance except a few, details are as follows, (all time are given in ms) Query | Head| Patch -+--+ Q8 | 8720 | 8839 Q18 | 370710 | 384347 Q21 | 53270 | 65189 Clearly, regression in Q8 and Q18 is minor but that in Q21 is significant. Just to confirm, I have applied the tweak mentioned in [1] as before, For the explain analyse output of Q21 on head and with patch, please check the attached file. [1] https://www.postgresql.org/message-id/flat/CAEepm%3D270ze2hVxWkJw-5eKzc3AB4C9KpH3L2kih75R5pdSogg%40mail.gmail.com -- Regards, Rafia Sabih EnterpriseDB: http://www.enterprisedb.com/ q21_reg_ph.tar.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
Hi, On 2017-03-27 22:33:03 -0700, Andres Freund wrote: > On 2017-03-23 20:35:09 +1300, Thomas Munro wrote: > > Here is a new patch series responding to feedback from Peter and Andres: > > Here's a review of 0007 & 0010 together - they're going to have to be > applied together anyway... > ... > ok, ENOTIME for today... Continuing, where I dropped of tiredly yesterday. - ExecHashJoinSaveTuple(tuple, - hashvalue, - >innerBatchFile[batchno]); + if (HashJoinTableIsShared(hashtable)) + sts_puttuple(hashtable->shared_inner_batches, batchno, , +tuple); + else + ExecHashJoinSaveTuple(tuple, + hashvalue, + >innerBatchFile[batchno]); } } Why isn't this done inside of ExecHashJoinSaveTuple? @@ -1280,6 +1785,68 @@ ExecHashTableReset(HashJoinTable hashtable) + /* Rewind the shared read heads for this batch, inner and outer. */ + sts_prepare_parallel_read(hashtable->shared_inner_batches, + curbatch); + sts_prepare_parallel_read(hashtable->shared_outer_batches, + curbatch); It feels somewhat wrong to do this in here, rather than on the callsites. + } + + /* +* Each participant needs to make sure that data it has written for +* this partition is now read-only and visible to other participants. +*/ + sts_end_write(hashtable->shared_inner_batches, curbatch); + sts_end_write(hashtable->shared_outer_batches, curbatch); + + /* +* Wait again, so that all workers see the new hash table and can +* safely read from batch files from any participant because they have +* all ended writing. +*/ + Assert(BarrierPhase(>shared->barrier) == + PHJ_PHASE_RESETTING_BATCH(curbatch)); + BarrierWait(>shared->barrier, WAIT_EVENT_HASH_RESETTING); + Assert(BarrierPhase(>shared->barrier) == + PHJ_PHASE_LOADING_BATCH(curbatch)); + ExecHashUpdate(hashtable); + + /* Forget the current chunks. */ + hashtable->current_chunk = NULL; + return; + } /* * Release all the hash buckets and tuples acquired in the prior pass, and @@ -1289,10 +1856,10 @@ ExecHashTableReset(HashJoinTable hashtable) oldcxt = MemoryContextSwitchTo(hashtable->batchCxt); /* Reallocate and reinitialize the hash bucket headers. */ - hashtable->buckets = (HashJoinTuple *) - palloc0(nbuckets * sizeof(HashJoinTuple)); + hashtable->buckets = (HashJoinBucketHead *) + palloc0(nbuckets * sizeof(HashJoinBucketHead)); - hashtable->spaceUsed = nbuckets * sizeof(HashJoinTuple); + hashtable->spaceUsed = nbuckets * sizeof(HashJoinBucketHead); /* Cannot be more than our previous peak; we had this size before. */ Assert(hashtable->spaceUsed <= hashtable->spacePeak); @@ -1301,6 +1868,22 @@ ExecHashTableReset(HashJoinTable hashtable) /* Forget the chunks (the memory was freed by the context reset above). */ hashtable->chunks = NULL; + + /* Rewind the shared read heads for this batch, inner and outer. */ + if (hashtable->innerBatchFile[curbatch] != NULL) + { + if (BufFileSeek(hashtable->innerBatchFile[curbatch], 0, 0L, SEEK_SET)) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not rewind hash-join temporary file: %m"))); + } + if (hashtable->outerBatchFile[curbatch] != NULL) + { + if (BufFileSeek(hashtable->outerBatchFile[curbatch], 0, 0L, SEEK_SET)) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not rewind hash-join temporary file: %m"))); + } } /* @@ -1310,12 +1893,21 @@ ExecHashTableReset(HashJoinTable hashtable) void ExecHashTableResetMatchFlags(HashJoinTable hashtable) { + dsa_pointer chunk_shared = InvalidDsaPointer; HashMemoryChunk chunk; HashJoinTuple tuple; int i; /* Reset all flags in the main table ... */ - chunk = hashtable->chunks; + if
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
Hi Thomas, On 3/28/17 1:41 AM, Rafia Sabih wrote: On Mon, Mar 27, 2017 at 12:20 PM, Thomas Munro I thought this last point about Windows might be fatal to my design, but it seems that Windows since at least version 2000 has support for Unixoid unlinkability via the special flag FILE_SHARE_DELETE. <...> Please find the attached file for the explain analyse output of these queries on head as well as patch. Would be working on analysing the performance of this patch on 300 scale factor. I have marked this submission "Waiting for Author". A new patch is needed to address Andres' comments and you should have a look at Rafia's results. Thanks, -- -David da...@pgmasters.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On 2017-03-23 20:35:09 +1300, Thomas Munro wrote: > Here is a new patch series responding to feedback from Peter and Andres: Here's a review of 0007 & 0010 together - they're going to have to be applied together anyway... diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index ac339fb566..775c9126c7 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -3814,6 +3814,21 @@ ANY num_sync ( + cpu_shared_tuple_cost (floating point) + + cpu_shared_tuple_cost configuration parameter + + + + +Sets the planner's estimate of the cost of sharing rows in +memory during a parallel query. +The default is 0.001. + + + + Isn't that really low in comparison to the other costs? I think specifying a bit more what this actually measures would be good too - is it putting the tuple in shared memory? Is it accessing it? + + cpu_synchronization_cost (floating point) + + cpu_synchronization_cost configuration parameter + + + + +Sets the planner's estimate of the cost of waiting at synchronization +points for other processes while executing parallel queries. +The default is 1.0. + + + Isn't this also really cheap in comparison to a, probably cached, seq page read? + if (HashJoinTableIsShared(hashtable)) + { + /* +* Synchronize parallel hash table builds. At this stage we know that +* the shared hash table has been created, but we don't know if our +* peers are still in MultiExecHash and if so how far through. We use +* the phase to synchronize with them. +*/ + barrier = >shared->barrier; + + switch (BarrierPhase(barrier)) + { + case PHJ_PHASE_BEGINNING: Note pgindent will indent this further. Might be worthwhile to try to pgindent the file, revert some of the unintended damage. /* * set expression context */ I'd still like this to be moved to the start. @@ -126,17 +202,79 @@ MultiExecHash(HashState *node) /* Not subject to skew optimization, so insert normally */ ExecHashTableInsert(hashtable, slot, hashvalue); } - hashtable->totalTuples += 1; + hashtable->partialTuples += 1; + if (!HashJoinTableIsShared(hashtable)) + hashtable->totalTuples += 1; } } FWIW, I'd put HashJoinTableIsShared() into a local var - the compiler won't be able to do that on its own because external function calls could invalidate the result. That brings me to a related topic: Have you measured whether your changes cause performance differences? + finish_loading(hashtable); I find the sudden switch to a different naming scheme in the same file a bit jarring. + if (HashJoinTableIsShared(hashtable)) + BarrierDetach(>shared->shrink_barrier); + + if (HashJoinTableIsShared(hashtable)) + { Consecutive if blocks with the same condition... + bool elected_to_resize; + + /* +* Wait for all backends to finish building. If only one worker is +* running the building phase because of a non-partial inner plan, the +* other workers will pile up here waiting. If multiple worker are +* building, they should finish close to each other in time. +*/ That comment is outdated, isn't it? /* resize the hash table if needed (NTUP_PER_BUCKET exceeded) */ - if (hashtable->nbuckets != hashtable->nbuckets_optimal) - ExecHashIncreaseNumBuckets(hashtable); + ExecHashUpdate(hashtable); + ExecHashIncreaseNumBuckets(hashtable); So this now doesn't actually increase the number of buckets anymore. + reinsert: + /* If the table was resized, insert tuples into the new buckets. */ + ExecHashUpdate(hashtable); + ExecHashReinsertAll(hashtable); ReinsertAll just happens to do nothing if we didn't have to resize... Not entirely obvious, sure reads as if it were unconditional. Also, it's not actually "All" when batching is in use, no? + post_resize: + if (HashJoinTableIsShared(hashtable)) + { + Assert(BarrierPhase(barrier) == PHJ_PHASE_RESIZING); + BarrierWait(barrier, WAIT_EVENT_HASH_RESIZING); + Assert(BarrierPhase(barrier) == PHJ_PHASE_REINSERTING); + } + + reinsert: + /* If the table was resized, insert tuples into the new buckets. */ + ExecHashUpdate(hashtable); + ExecHashReinsertAll(hashtable); Hm. So even non-resizing backends reach this - but they happen to not do anything
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Sun, Mar 26, 2017 at 3:56 PM, Thomas Munrowrote: > But... what you said above must be a problem for Windows. I believe > it doesn't allow files to be unlinked if they are open, and I see that > DSM segments are cleaned up in resowner's phase == > RESOURCE_RELEASE_BEFORE_LOCKS and files are closed in phase == > RESOURCE_RELEASE_AFTER_LOCKS. I thought this last point about Windows might be fatal to my design, but it seems that Windows since at least version 2000 has support for Unixoid unlinkability via the special flag FILE_SHARE_DELETE. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Sun, Mar 26, 2017 at 6:50 PM, Thomas Munrowrote: > Like you, I also tend to suspect that people would be more likely to > use RAID type technologies to stripe things like this for both > bandwidth and space reasons these days. Tablespaces seem to make more > sense as a way of separating different classes of storage > (fast/expensive, slow/cheap etc), not as an IO or space striping > technique. I agree. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Mon, Mar 27, 2017 at 12:12 PM, Peter Geogheganwrote: > On Sun, Mar 26, 2017 at 3:41 PM, Thomas Munro > wrote: >>> 1. Segments are what buffile.c already calls the individual >>> capped-at-1GB files that it manages. They are an implementation >>> detail that is not part of buffile.c's user interface. There seems to >>> be no reason to change that. >> >> After reading your next email I realised this is not quite true: >> BufFileTell and BufFileSeek expose the existence of segments. > > Yeah, that's something that tuplestore.c itself relies on. > > I always thought that the main reason practical why we have BufFile > multiplex 1GB segments concerns use of temp_tablespaces, rather than > considerations that matter only when using obsolete file systems: > > /* > * We break BufFiles into gigabyte-sized segments, regardless of RELSEG_SIZE. > * The reason is that we'd like large temporary BufFiles to be spread across > * multiple tablespaces when available. > */ > > Now, I tend to think that most installations that care about > performance would be better off using RAID to stripe their one temp > tablespace file system. But, I suppose this still makes sense when you > have a number of file systems that happen to be available, and disk > capacity is the main concern. PHJ uses one temp tablespace per worker, > which I further suppose might not be as effective in balancing disk > space usage. I was thinking about IO bandwidth balance rather than size. If you rotate through tablespaces segment-by-segment, won't you be exposed to phasing effects that could leave disk arrays idle for periods of time? Whereas if you assign them to participants, you can only get idle arrays if you have fewer participants than tablespaces. This seems like a fairly complex subtopic and I don't have a strong view on it. Clearly you could rotate through tablespaces on the basis of participant, partition, segment, some combination, or something else. Doing it by participant seemed to me to be the least prone to IO imbalance cause by phasing effects (= segment based) or data distribution (= partition based), of the options I considered when I wrote it that way. Like you, I also tend to suspect that people would be more likely to use RAID type technologies to stripe things like this for both bandwidth and space reasons these days. Tablespaces seem to make more sense as a way of separating different classes of storage (fast/expensive, slow/cheap etc), not as an IO or space striping technique. I may be way off base there though... -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Sun, Mar 26, 2017 at 3:41 PM, Thomas Munrowrote: >> 1. Segments are what buffile.c already calls the individual >> capped-at-1GB files that it manages. They are an implementation >> detail that is not part of buffile.c's user interface. There seems to >> be no reason to change that. > > After reading your next email I realised this is not quite true: > BufFileTell and BufFileSeek expose the existence of segments. Yeah, that's something that tuplestore.c itself relies on. I always thought that the main reason practical why we have BufFile multiplex 1GB segments concerns use of temp_tablespaces, rather than considerations that matter only when using obsolete file systems: /* * We break BufFiles into gigabyte-sized segments, regardless of RELSEG_SIZE. * The reason is that we'd like large temporary BufFiles to be spread across * multiple tablespaces when available. */ Now, I tend to think that most installations that care about performance would be better off using RAID to stripe their one temp tablespace file system. But, I suppose this still makes sense when you have a number of file systems that happen to be available, and disk capacity is the main concern. PHJ uses one temp tablespace per worker, which I further suppose might not be as effective in balancing disk space usage. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Mon, Mar 27, 2017 at 11:03 AM, Thomas Munro>> Is there a risk that this ends up running afoul of filename length >> limits on some platforms? > > Hmm. I didn't think so. Do we have a project guideline on maximum > path lengths based on some kind of survey? There are various limits > involved (filesystem and OS per-path-component limits, total limits, > and the confusing PATH_MAX, MAX_PATH etc macros), but I was under the > impression that these numbers were always at least 255. This scheme > seems capable of producing ~50 bytes in the final component > (admittedly more if int is 64 bits), and then nowhere near enough to > reach a limit of that order in the earlier components. Err, plus prefix. Still seems unlikely to be too long. >> I'm a bit unhappy with the partition terminology around this. It's >> getting a bit confusing. We have partitions, participants and >> segements. Most of them could be understood for something entirely >> different than the meaning you have here... > > Ok. Let me try to explain and defend them and see if we can come up > with something better. > > 1. Segments are what buffile.c already calls the individual > capped-at-1GB files that it manages. They are an implementation > detail that is not part of buffile.c's user interface. There seems to > be no reason to change that. After reading your next email I realised this is not quite true: BufFileTell and BufFileSeek expose the existence of segments. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Mon, Mar 27, 2017 at 9:41 AM, Andres Freundwrote: > Hi, > > > SharedBufFile allows temporary files to be created by one backend and > then exported for read-only access by other backends, with clean-up > managed by reference counting associated with a DSM segment. This includes > changes to fd.c and buffile.c to support new kinds of temporary file. > > > diff --git a/src/backend/storage/file/buffile.c > b/src/backend/storage/file/buffile.c > index 4ca0ea4..a509c05 100644 > --- a/src/backend/storage/file/buffile.c > +++ b/src/backend/storage/file/buffile.c > > I think the new facilities should be explained in the file's header. Will do. > @@ -68,9 +71,10 @@ struct BufFile > * avoid making redundant FileSeek calls. > */ > > - boolisTemp; /* can only add files if this > is TRUE */ > + boolisSegmented;/* can only add files if this is TRUE > */ > > That's a bit of a weird and uncommented upon change. I was trying to cut down on the number of places we use the word 'temporary' to activate various different behaviours. In this case, the only thing it controls is whether the BufFile is backed by one single fd.c File or many segments, so I figured it should be renamed. As Peter and you have pointed out, there may be a case for removing it altogether. > @@ -79,6 +83,8 @@ struct BufFile > */ > ResourceOwner resowner; > > + BufFileTag tag;/* for discoverability > between backends */ > > Not perfectly happy with the name tag here, the name is a bit too > similar to BufferTag - something quite different. Yeah, will rename. > +static void > +make_tagged_path(char *tempdirpath, char *tempfilepath, > +const BufFileTag *tag, int segment) > +{ > + if (tag->tablespace == DEFAULTTABLESPACE_OID || > + tag->tablespace == GLOBALTABLESPACE_OID) > + snprintf(tempdirpath, MAXPGPATH, "base/%s", > PG_TEMP_FILES_DIR); > + else > + { > + snprintf(tempdirpath, MAXPGPATH, "pg_tblspc/%u/%s/%s", > +tag->tablespace, > TABLESPACE_VERSION_DIRECTORY, > +PG_TEMP_FILES_DIR); > + } > + > + snprintf(tempfilepath, MAXPGPATH, "%s/%s%d.%d.%d.%d.%d", tempdirpath, > +PG_TEMP_FILE_PREFIX, > +tag->creator_pid, tag->set, tag->partition, > tag->participant, > +segment); > > Is there a risk that this ends up running afoul of filename length > limits on some platforms? Hmm. I didn't think so. Do we have a project guideline on maximum path lengths based on some kind of survey? There are various limits involved (filesystem and OS per-path-component limits, total limits, and the confusing PATH_MAX, MAX_PATH etc macros), but I was under the impression that these numbers were always at least 255. This scheme seems capable of producing ~50 bytes in the final component (admittedly more if int is 64 bits), and then nowhere near enough to reach a limit of that order in the earlier components. > +} > + > +static File > +make_tagged_segment(const BufFileTag *tag, int segment) > +{ > + Filefile; > + chartempdirpath[MAXPGPATH]; > + chartempfilepath[MAXPGPATH]; > + > + /* > +* There is a remote chance that disk files with this (pid, set) pair > +* already exists after a crash-restart. Since the presence of > +* consecutively numbered segment files is used by BufFileOpenShared > to > +* determine the total size of a shared BufFile, we'll defend against > +* confusion by unlinking segment 1 (if it exists) before creating > segment > +* 0. > +*/ > > Gah. Why on earth aren't we removing temp files when restarting, not > just on the initial start? That seems completely wrong? See the comment above RemovePgTempFiles in fd.c. From comments on this list I understand that this is a subject that Robert and Tom don't agree on. I don't mind either way, but as long as RemovePgTempFiles works that way and my patch uses the existence of files to know how many files there are, I have to defend against that danger by making sure that I don't accidentally identify files from before a crash/restart as active. > If we do decide not to change this: Why is that sufficient? Doesn't the > same problem exist for segments later than the first? It does exist and it is handled. The comment really should say "unlinking segment N + 1 (if it exists) before creating segment N". Will update. > +/* > + * Open a file that was previously created in another backend with > + * BufFileCreateShared. > + */ > +BufFile * > +BufFileOpenTagged(const BufFileTag *tag) > +{ > + BufFile*file = (BufFile *) palloc(sizeof(BufFile)); > + chartempdirpath[MAXPGPATH]; > +
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On 2017-03-23 20:35:09 +1300, Thomas Munro wrote: > Here is a new patch series responding to feedback from Peter and Andres: + +/* Per-participant shared state. */ +typedef struct SharedTuplestoreParticipant +{ + LWLock lock; Hm. No padding (ala LWLockMinimallyPadded / LWLockPadded) - but that's probably ok, for now. + bool error; /* Error occurred flag. */ + bool eof; /* End of file reached. */ + int read_fileno;/* BufFile segment file number. */ + off_t read_offset; /* Offset within segment file. */ Hm. I wonder if it'd not be better to work with 64bit offsets, and just separate that out upon segment access. +/* The main data structure in shared memory. */ "main data structure" isn't particularly meaningful. +struct SharedTuplestore +{ + int reading_partition; + int nparticipants; + int flags; Maybe add a comment saying /* flag bits from SHARED_TUPLESTORE_* */? + Size meta_data_size; What's this? + SharedTuplestoreParticipant participants[FLEXIBLE_ARRAY_MEMBER]; I'd add a comment here, that there's further data after participants. +}; + +/* Per-participant backend-private state. */ +struct SharedTuplestoreAccessor +{ Hm. The name and it being backend-local are a bit conflicting. + int participant;/* My partitipant number. */ + SharedTuplestore *sts; /* The shared state. */ + int nfiles; /* Size of local files array. */ + BufFile **files;/* Files we have open locally for writing. */ Shouldn't this mention that it's indexed by partition? + BufFile *read_file; /* The current file to read from. */ + int read_partition; /* The current partition to read from. */ + int read_participant; /* The current participant to read from. */ + int read_fileno;/* BufFile segment file number. */ + off_t read_offset; /* Offset within segment file. */ +}; +/* + * Initialize a SharedTuplestore in existing shared memory. There must be + * space for sts_size(participants) bytes. If flags is set to the value + * SHARED_TUPLESTORE_SINGLE_PASS then each partition may only be read once, + * because underlying files will be deleted. Any reason not to use flags that are compatible with tuplestore.c? + * Tuples that are stored may optionally carry a piece of fixed sized + * meta-data which will be retrieved along with the tuple. This is useful for + * the hash codes used for multi-batch hash joins, but could have other + * applications. + */ +SharedTuplestoreAccessor * +sts_initialize(SharedTuplestore *sts, int participants, + int my_participant_number, + Size meta_data_size, + int flags, + dsm_segment *segment) +{ Not sure I like that the naming here has little in common with tuplestore.h's api. + +MinimalTuple +sts_gettuple(SharedTuplestoreAccessor *accessor, void *meta_data) +{ This needs docs. + SharedBufFileSet *fileset = GetSharedBufFileSet(accessor->sts); + MinimalTuple tuple = NULL; + + for (;;) + { ... + /* Check if this participant's file has already been entirely read. */ + if (participant->eof) + { + BufFileClose(accessor->read_file); + accessor->read_file = NULL; + LWLockRelease(>lock); + continue; Why are we closing the file while holding the lock? + + /* Read the optional meta-data. */ + eof = false; + if (accessor->sts->meta_data_size > 0) + { + nread = BufFileRead(accessor->read_file, meta_data, + accessor->sts->meta_data_size); + if (nread == 0) + eof = true; + else if (nread != accessor->sts->meta_data_size) + ereport(ERROR, + (errcode_for_file_access(), +errmsg("could not read from temporary file: %m"))); + } + + /* Read the size. */ + if (!eof) + { + nread = BufFileRead(accessor->read_file, _size, sizeof(tuple_size)); + if (nread == 0) + eof = true; Why is it legal to have EOF here, if metadata previously didn't have an EOF? Perhaps add an error if accessor->sts->meta_data_size != 0? + if (eof) + { +
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
Hi, SharedBufFile allows temporary files to be created by one backend and then exported for read-only access by other backends, with clean-up managed by reference counting associated with a DSM segment. This includes changes to fd.c and buffile.c to support new kinds of temporary file. diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c index 4ca0ea4..a509c05 100644 --- a/src/backend/storage/file/buffile.c +++ b/src/backend/storage/file/buffile.c I think the new facilities should be explained in the file's header. @@ -68,9 +71,10 @@ struct BufFile * avoid making redundant FileSeek calls. */ - boolisTemp; /* can only add files if this is TRUE */ + boolisSegmented;/* can only add files if this is TRUE */ That's a bit of a weird and uncommented upon change. @@ -79,6 +83,8 @@ struct BufFile */ ResourceOwner resowner; + BufFileTag tag;/* for discoverability between backends */ Not perfectly happy with the name tag here, the name is a bit too similar to BufferTag - something quite different. +static void +make_tagged_path(char *tempdirpath, char *tempfilepath, +const BufFileTag *tag, int segment) +{ + if (tag->tablespace == DEFAULTTABLESPACE_OID || + tag->tablespace == GLOBALTABLESPACE_OID) + snprintf(tempdirpath, MAXPGPATH, "base/%s", PG_TEMP_FILES_DIR); + else + { + snprintf(tempdirpath, MAXPGPATH, "pg_tblspc/%u/%s/%s", +tag->tablespace, TABLESPACE_VERSION_DIRECTORY, +PG_TEMP_FILES_DIR); + } + + snprintf(tempfilepath, MAXPGPATH, "%s/%s%d.%d.%d.%d.%d", tempdirpath, +PG_TEMP_FILE_PREFIX, +tag->creator_pid, tag->set, tag->partition, tag->participant, +segment); Is there a risk that this ends up running afoul of filename length limits on some platforms? +} + +static File +make_tagged_segment(const BufFileTag *tag, int segment) +{ + Filefile; + chartempdirpath[MAXPGPATH]; + chartempfilepath[MAXPGPATH]; + + /* +* There is a remote chance that disk files with this (pid, set) pair +* already exists after a crash-restart. Since the presence of +* consecutively numbered segment files is used by BufFileOpenShared to +* determine the total size of a shared BufFile, we'll defend against +* confusion by unlinking segment 1 (if it exists) before creating segment +* 0. +*/ Gah. Why on earth aren't we removing temp files when restarting, not just on the initial start? That seems completely wrong? If we do decide not to change this: Why is that sufficient? Doesn't the same problem exist for segments later than the first? +/* + * Open a file that was previously created in another backend with + * BufFileCreateShared. + */ +BufFile * +BufFileOpenTagged(const BufFileTag *tag) +{ + BufFile*file = (BufFile *) palloc(sizeof(BufFile)); + chartempdirpath[MAXPGPATH]; + chartempfilepath[MAXPGPATH]; + Sizecapacity = 1024; + File *files = palloc(sizeof(File) * capacity); + int nfiles = 0; + + /* +* We don't know how many segments there are, so we'll probe the +* filesystem to find out. +*/ + for (;;) + { + /* See if we need to expand our file space. */ + if (nfiles + 1 > capacity) + { + capacity *= 2; + files = repalloc(files, sizeof(File) * capacity); + } + /* Try to load a segment. */ + make_tagged_path(tempdirpath, tempfilepath, tag, nfiles); + files[nfiles] = PathNameOpenTemporaryFile(tempfilepath); + if (files[nfiles] <= 0) + break; Isn't 0 a theoretically valid return value from PathNameOpenTemporaryFile? +/* + * Delete a BufFile that was created by BufFileCreateTagged. Return true if + * at least one segment was deleted; false indicates that no segment was + * found, or an error occurred while trying to delete. Errors are logged but + * the function returns normally because this is assumed to run in a clean-up + * path that might already involve an error. + */ +bool +BufFileDeleteTagged(const BufFileTag *tag) +{ + chartempdirpath[MAXPGPATH]; + chartempfilepath[MAXPGPATH]; + int segment = 0; + boolfound = false; + + /* +* We don't know if the BufFile really exists, because SharedBufFile +* tracks only the range of file numbers. If it does exists, we don't +* know many
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Sat, Mar 25, 2017 at 7:56 PM, Thomas Munrowrote: > On Sun, Mar 26, 2017 at 1:53 PM, Peter Geoghegan wrote: >> ISTM that your patch now shares a quality with parallel tuplesort: You >> may now hold files open after an unlink() of the original link/path >> that they were opened using. As Robert pointed out when discussing >> parallel tuplesort earlier in the week, that comes with the risk, >> however small, that the vFD cache will close() the file out from under >> us during LRU maintenance, resulting in a subsequent open() (at the >> tail-end of the vFD's lifetime) that fails unexpectedly. It's probably >> fine to assume that we can sanely close() the file ourselves in fd.c >> error paths despite a concurrent unlink(), since we never operate on >> the link itself, and there probably isn't much pressure on each >> backend's vFD cache. But, is that good enough? I can't say, though I >> suspect that this particular risk is one that's best avoided. >> >> I haven't tested out how much of a problem this might be for your >> patch, but I do know that resowner.c will call your shared mem segment >> callback before closing any backend local vFDs, so I can't imagine how >> it could be that this risk doesn't exist. > > I wouldn't have expected anything like that to be a problem, because > FileClose() doesn't call FileAccess(). So IIUC it wouldn't ever try > to reopen a kernel fd just to close it. The concern is that something somewhere does. For example, mdread() calls FileRead(), which calls FileAccess(), ultimately because of some obscure catalog access. It's very hard to reason about things like that. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Sun, Mar 26, 2017 at 1:53 PM, Peter Geogheganwrote: > ISTM that your patch now shares a quality with parallel tuplesort: You > may now hold files open after an unlink() of the original link/path > that they were opened using. As Robert pointed out when discussing > parallel tuplesort earlier in the week, that comes with the risk, > however small, that the vFD cache will close() the file out from under > us during LRU maintenance, resulting in a subsequent open() (at the > tail-end of the vFD's lifetime) that fails unexpectedly. It's probably > fine to assume that we can sanely close() the file ourselves in fd.c > error paths despite a concurrent unlink(), since we never operate on > the link itself, and there probably isn't much pressure on each > backend's vFD cache. But, is that good enough? I can't say, though I > suspect that this particular risk is one that's best avoided. > > I haven't tested out how much of a problem this might be for your > patch, but I do know that resowner.c will call your shared mem segment > callback before closing any backend local vFDs, so I can't imagine how > it could be that this risk doesn't exist. I wouldn't have expected anything like that to be a problem, because FileClose() doesn't call FileAccess(). So IIUC it wouldn't ever try to reopen a kernel fd just to close it. But... what you said above must be a problem for Windows. I believe it doesn't allow files to be unlinked if they are open, and I see that DSM segments are cleaned up in resowner's phase == RESOURCE_RELEASE_BEFORE_LOCKS and files are closed in phase == RESOURCE_RELEASE_AFTER_LOCKS. Hmm. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Thu, Mar 23, 2017 at 12:35 AM, Thomas Munrowrote: > Thanks. I really appreciate your patience with the resource > management stuff I had failed to think through. It's a surprisingly difficult problem, that almost requires prototyping just to explain. No need to apologize. This is the process by which many hard problems end up being solved. >> However, I notice that the place that this happens instead, >> PathNameDelete(), does not repeat the fd.c step of doing a final >> stat(), and using the stats for a pgstat_report_tempfile(). So, a new >> pgstat_report_tempfile() call is simply missing. However, the more >> fundamental issue in my mind is: How can you fix that? Where would it >> go if you had it? > > You're right. I may be missing something here (again), but it does > seem straightforward to implement because we always delete each file > that really exists exactly once (and sometimes we also try to delete > files that don't exist due to imprecise meta-data, but that isn't > harmful and we know when that turns out to be the case). ISTM that your patch now shares a quality with parallel tuplesort: You may now hold files open after an unlink() of the original link/path that they were opened using. As Robert pointed out when discussing parallel tuplesort earlier in the week, that comes with the risk, however small, that the vFD cache will close() the file out from under us during LRU maintenance, resulting in a subsequent open() (at the tail-end of the vFD's lifetime) that fails unexpectedly. It's probably fine to assume that we can sanely close() the file ourselves in fd.c error paths despite a concurrent unlink(), since we never operate on the link itself, and there probably isn't much pressure on each backend's vFD cache. But, is that good enough? I can't say, though I suspect that this particular risk is one that's best avoided. I haven't tested out how much of a problem this might be for your patch, but I do know that resowner.c will call your shared mem segment callback before closing any backend local vFDs, so I can't imagine how it could be that this risk doesn't exist. FWIW, I briefly entertained the idea that we could pin a vFD for just a moment, ensuring that the real FD could not be close()'d out by vfdcache LRU maintenance, which would fix this problem for parallel tuplesort, I suppose. That may not be workable for PHJ, because PHJ would probably need to hold on to such a "pin" for much longer, owing to the lack of any explicit "handover" phase. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
Hi, Here is a new patch series responding to feedback from Peter and Andres: 1. Support pgstat_report_tempfile and log_temp_files, which I had overlooked as Peter pointed out. 2. Use a patch format that is acceptable to git am, per complaint off-list from Andres. (Not actually made with git format-patch; I need to learn some more git-fu, but they now apply cleanly with git am). On Thu, Mar 23, 2017 at 12:55 PM, Peter Geogheganwrote: > I took a quick look at your V9 today. This is by no means a > comprehensive review of 0008-hj-shared-buf-file-v9.patch, but it's > what I can manage right now. Thanks. I really appreciate your patience with the resource management stuff I had failed to think through. > ... > > However, I notice that the place that this happens instead, > PathNameDelete(), does not repeat the fd.c step of doing a final > stat(), and using the stats for a pgstat_report_tempfile(). So, a new > pgstat_report_tempfile() call is simply missing. However, the more > fundamental issue in my mind is: How can you fix that? Where would it > go if you had it? You're right. I may be missing something here (again), but it does seem straightforward to implement because we always delete each file that really exists exactly once (and sometimes we also try to delete files that don't exist due to imprecise meta-data, but that isn't harmful and we know when that turns out to be the case). > If you do the obvious thing of just placing that before the new > unlink() within PathNameDelete(), on the theory that that needs parity > with the fd.c stuff, that has non-obvious implications. Does the > pgstat_report_tempfile() call need to happen when going through this > path, for example?: > >> +/* >> + * Destroy a shared BufFile early. Files are normally cleaned up >> + * automatically when all participants detach, but it might be useful to >> + * reclaim disk space sooner than that. The caller asserts that no backends >> + * will attempt to read from this file again and that only one backend will >> + * destroy it. >> + */ >> +void >> +SharedBufFileDestroy(SharedBufFileSet *set, int partition, int participant) >> +{ Yes, I think it should definitely go into PathNameDeleteTemporaryFile() (formerly PathNameDelete()). > The theory with the unlink()'ing() function PathNameDelete(), I > gather, is that it doesn't matter if it happens to be called more than > once, say from a worker and then in an error handling path in the > leader or whatever. Do I have that right? Yes, it may be called for a file that doesn't exist either because it never existed, or because it has already been deleted. To recap, there are two reasons it needs to tolerate attempts to delete files that aren't there: 1. To be able to delete the fd.c files backing a BufFile given only a BufFileTag. We don't know how many segment files there are, but we know how to build the prefix of the filename so we try to delete [prefix].0, [prefix].1, [prefix].2 ... until we get ENOENT and terminate. I think this sort of thing would be more questionable for durable storage backing a database object, but for temporary files I can't think of a problem with it. 2. SharedBufFileSet doesn't actually know how many partitions exist, it just knows the *range* of partition numbers (because of its conflicting fixed space and increasable partitions requirements). >From that information it can loop building BufFileTags for all backing files that *might* exist, and in practice they usually do because we don't tend to have a 'sparse' range of partitions. The error handling path isn't a special case: whoever is the last to detach from the DSM segment will delete all the files, whether that results from an error or not. Now someone might call SharedBufFileDestroy() to delete files sooner, but that can't happen at the same time as a detach cleanup (the caller is still attached). As a small optimisation avoiding a bunch of pointless unlink syscalls, I shrink the SharedBufFileSet range if you happen to delete explicitly with a partition number at the extremities of the range, and it so happens that Parallel Hash Join explicitly deletes them in partition order as the join runs, so in practice the range is empty by the time SharedBufFileSet's cleanup runs and there is nothing to do, unless an error occurs. > Obviously the concern I have about that is that any stat() call you > might add for the benefit of a new pgstat_report_tempfile() call, > needed to keep parity with fd.c, now has a risk of double counting in > error paths, if I'm not mistaken. We do need to do that accounting in > the event of error, just as we do when there is no error, at least if > current stats collector behavior is to be preserved. How can you > determine which duplicate call here is the duplicate? In other words, > how can you figure out which one is not supposed to > pgstat_report_tempfile()? If the size of temp files in each worker is > unknowable to the
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Wed, Mar 22, 2017 at 3:17 AM, Thomas Munrowrote: >> If I follow the new code correctly, then it doesn't matter that you've >> unlink()'d to take care of the more obvious resource management chore. >> You can still have a reference leak like this, if I'm not mistaken, >> because you still have backend local state (local VfdCache) that is >> left totally decoupled with the new "shadow resource manager" for >> shared BufFiles. > > You're right. The attached version fixes these problems. The > BufFiles created or opened in this new way now participate in both of > our leak-detection and clean-up schemes: the one in resowner.c > (because I'm now explicitly registering with it as I had failed to do > before) and the one in CleanupTempFiles (because FD_CLOSE_AT_EOXACT is > set, which I already had in the previous version for the creator, but > not the opener of such a file). I tested by commenting out my > explicit BufFileClose calls to check that resowner.c starts > complaining, and then by commenting out the resowner registration too > to check that CleanupTempFiles starts complaining. I took a quick look at your V9 today. This is by no means a comprehensive review of 0008-hj-shared-buf-file-v9.patch, but it's what I can manage right now. The main change you made is well represented by the following part of the patch, where you decouple close at eoXact with delete at eoXact, with the intention of doing one but not the other for BufFiles that are shared: > /* these are the assigned bits in fdstate below: */ > -#define FD_TEMPORARY (1 << 0)/* T = delete when closed */ > -#define FD_XACT_TEMPORARY (1 << 1)/* T = delete at eoXact */ > +#define FD_DELETE_AT_CLOSE (1 << 0)/* T = delete when closed */ > +#define FD_CLOSE_AT_EOXACT (1 << 1)/* T = close at eoXact */ > +#define FD_TEMP_FILE_LIMIT (1 << 2)/* T = respect temp_file_limit */ So, shared BufFile fd.c segments within backend local resource manager do not have FD_DELETE_AT_CLOSE set, because you mean to do that part yourself by means of generic shared cleanup through dynamic shared memory segment callback. So far so good. However, I notice that the place that this happens instead, PathNameDelete(), does not repeat the fd.c step of doing a final stat(), and using the stats for a pgstat_report_tempfile(). So, a new pgstat_report_tempfile() call is simply missing. However, the more fundamental issue in my mind is: How can you fix that? Where would it go if you had it? If you do the obvious thing of just placing that before the new unlink() within PathNameDelete(), on the theory that that needs parity with the fd.c stuff, that has non-obvious implications. Does the pgstat_report_tempfile() call need to happen when going through this path, for example?: > +/* > + * Destroy a shared BufFile early. Files are normally cleaned up > + * automatically when all participants detach, but it might be useful to > + * reclaim disk space sooner than that. The caller asserts that no backends > + * will attempt to read from this file again and that only one backend will > + * destroy it. > + */ > +void > +SharedBufFileDestroy(SharedBufFileSet *set, int partition, int participant) > +{ The theory with the unlink()'ing() function PathNameDelete(), I gather, is that it doesn't matter if it happens to be called more than once, say from a worker and then in an error handling path in the leader or whatever. Do I have that right? Obviously the concern I have about that is that any stat() call you might add for the benefit of a new pgstat_report_tempfile() call, needed to keep parity with fd.c, now has a risk of double counting in error paths, if I'm not mistaken. We do need to do that accounting in the event of error, just as we do when there is no error, at least if current stats collector behavior is to be preserved. How can you determine which duplicate call here is the duplicate? In other words, how can you figure out which one is not supposed to pgstat_report_tempfile()? If the size of temp files in each worker is unknowable to the implementation in error paths, does it not follow that it's unknowable to the user that queries pg_stat_database? Now, I don't imagine that this should stump you. Maybe I'm wrong about that possibility (that you cannot have exactly once unlink()/stat()/whatever), or maybe I'm right and you can fix it while preserving existing behavior, for example by relying on unlink() reliably failing when called a second time, no matter how tight any race was. What exact semantics does unlink() have with concurrency, as far as the link itself goes? If I'm not wrong about the general possibility, then maybe the existing behavior doesn't need to be preserved in error paths, which are after all exceptional -- it's not as if the statistics collector is currently highly reliable. It's not obvious that you are deliberately accepting of any of these risks or costs, though, which I think needs to
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
Hi, Here is a new version addressing feedback from Peter and Andres. Please see below. On Wed, Mar 22, 2017 at 3:18 PM, Peter Geogheganwrote: > On Tue, Mar 21, 2017 at 5:07 AM, Thomas Munro > wrote: >>> buffile.c should stop pretending to care about anything other than >>> temp files, IMV. 100% of all clients that want temporary files go >>> through buffile.c. 100% of all clients that want non-temp files (files >>> which are not marked FD_TEMPORARY) access fd.c directly, rather than >>> going through buffile.c. >> >> I still need BufFile because I want buffering. >> >> There are 3 separate characteristics enabled by flags with 'temporary' >> in their name. I think we should consider separating the concerns by >> splitting and renaming them: >> >> 1. Segmented BufFile behaviour. I propose renaming BufFile's isTemp >> member to isSegmented, because that is what it really does. I want >> that feature independently without getting confused about lifetimes. >> Tested with small MAX_PHYSICAL_FILESIZE as you suggested. > > I would have proposed to get rid of the isTemp field entirely. It is > always true with current usage, any only #ifdef NOT_USED code presumes > that it could be any other way. BufFile is all about temp files, which > ISTM should be formalized. The whole point of BufFile is to segment > fd.c temp file segments. Who would ever want to use BufFile without > that capability anyway? Yeah, it looks like you're probably right, but I guess others could have uses for BufFile that we don't know about. It doesn't seem like it hurts to leave the variable in existence. >> 2. The temp_file_limit system. Currently this applies to fd.c files >> opened with FD_TEMPORARY. You're right that we shouldn't be able to >> escape that sanity check on disk space just because we want to manage >> disk file ownership differently. I propose that we create a new flag >> FD_TEMP_FILE_LIMIT that can be set independentlyisTemp of the flags >> controlling disk file lifetime. When working with SharedBufFileSet, >> the limit applies to each backend in respect of files it created, >> while it has them open. This seems a lot simpler than any >> shared-temp-file-limit type scheme and is vaguely similar to the way >> work_mem applies in each backend for parallel query. > > I agree that that makes sense as a user-visible behavior of > temp_file_limit. This user-visible behavior is what I actually > implemented for parallel CREATE INDEX. Ok, good. >> 3. Delete-on-close/delete-at-end-of-xact. I don't want to use that >> facility so I propose disconnecting it from the above. We c{ould >> rename those fd.c-internal flags FD_TEMPORARY and FD_XACT_TEMPORARY to >> FD_DELETE_AT_CLOSE and FD_DELETE_AT_EOXACT. > > This reliably unlink()s all files, albeit while relying on unlink() > ENOENT as a condition that terminates deletion of one particular > worker's BufFile's segments. However, because you effectively no > longer use resowner.c, ISTM that there is still a resource leak in > error paths. ResourceOwnerReleaseInternal() won't call FileClose() for > temp-ish files (that are not quite temp files in the current sense) in > the absence of no other place managing to do so, such as > BufFileClose(). How can you be sure that you'll actually close() the > FD itself (not vFD) within fd.c in the event of an error? Or Delete(), > which does some LRU maintenance for backend's local VfdCache? Yeah, I definitely need to use resowner.c. The only thing I want to opt out of is automatic file deletion in that code path. > If I follow the new code correctly, then it doesn't matter that you've > unlink()'d to take care of the more obvious resource management chore. > You can still have a reference leak like this, if I'm not mistaken, > because you still have backend local state (local VfdCache) that is > left totally decoupled with the new "shadow resource manager" for > shared BufFiles. You're right. The attached version fixes these problems. The BufFiles created or opened in this new way now participate in both of our leak-detection and clean-up schemes: the one in resowner.c (because I'm now explicitly registering with it as I had failed to do before) and the one in CleanupTempFiles (because FD_CLOSE_AT_EOXACT is set, which I already had in the previous version for the creator, but not the opener of such a file). I tested by commenting out my explicit BufFileClose calls to check that resowner.c starts complaining, and then by commenting out the resowner registration too to check that CleanupTempFiles starts complaining. >> As shown in 0008-hj-shared-buf-file-v8.patch. Thoughts? > > A less serious issue I've also noticed is that you add palloc() calls, > implicitly using the current memory context, within buffile.c. > BufFileOpenTagged() has some, for example. However, there is a note > that we don't need to save the memory context when we open a BufFile > because we always
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Tue, Mar 21, 2017 at 7:18 PM, Peter Geogheganwrote: >> As shown in 0008-hj-shared-buf-file-v8.patch. Thoughts? > > A less serious issue I've also noticed is that you add palloc() calls, > implicitly using the current memory context, within buffile.c. > BufFileOpenTagged() has some, for example. However, there is a note > that we don't need to save the memory context when we open a BufFile > because we always repalloc(). That is no longer the case here. Similarly, I think that your new type of BufFile has no need to save CurrentResourceOwner, because it won't ever actually be used. I suppose that you should at least note this in comments. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Tue, Mar 21, 2017 at 5:07 AM, Thomas Munrowrote: >> buffile.c should stop pretending to care about anything other than >> temp files, IMV. 100% of all clients that want temporary files go >> through buffile.c. 100% of all clients that want non-temp files (files >> which are not marked FD_TEMPORARY) access fd.c directly, rather than >> going through buffile.c. > > I still need BufFile because I want buffering. > > There are 3 separate characteristics enabled by flags with 'temporary' > in their name. I think we should consider separating the concerns by > splitting and renaming them: > > 1. Segmented BufFile behaviour. I propose renaming BufFile's isTemp > member to isSegmented, because that is what it really does. I want > that feature independently without getting confused about lifetimes. > Tested with small MAX_PHYSICAL_FILESIZE as you suggested. I would have proposed to get rid of the isTemp field entirely. It is always true with current usage, any only #ifdef NOT_USED code presumes that it could be any other way. BufFile is all about temp files, which ISTM should be formalized. The whole point of BufFile is to segment fd.c temp file segments. Who would ever want to use BufFile without that capability anyway? > 2. The temp_file_limit system. Currently this applies to fd.c files > opened with FD_TEMPORARY. You're right that we shouldn't be able to > escape that sanity check on disk space just because we want to manage > disk file ownership differently. I propose that we create a new flag > FD_TEMP_FILE_LIMIT that can be set independentlyisTemp of the flags > controlling disk file lifetime. When working with SharedBufFileSet, > the limit applies to each backend in respect of files it created, > while it has them open. This seems a lot simpler than any > shared-temp-file-limit type scheme and is vaguely similar to the way > work_mem applies in each backend for parallel query. I agree that that makes sense as a user-visible behavior of temp_file_limit. This user-visible behavior is what I actually implemented for parallel CREATE INDEX. > 3. Delete-on-close/delete-at-end-of-xact. I don't want to use that > facility so I propose disconnecting it from the above. We c{ould > rename those fd.c-internal flags FD_TEMPORARY and FD_XACT_TEMPORARY to > FD_DELETE_AT_CLOSE and FD_DELETE_AT_EOXACT. This reliably unlink()s all files, albeit while relying on unlink() ENOENT as a condition that terminates deletion of one particular worker's BufFile's segments. However, because you effectively no longer use resowner.c, ISTM that there is still a resource leak in error paths. ResourceOwnerReleaseInternal() won't call FileClose() for temp-ish files (that are not quite temp files in the current sense) in the absence of no other place managing to do so, such as BufFileClose(). How can you be sure that you'll actually close() the FD itself (not vFD) within fd.c in the event of an error? Or Delete(), which does some LRU maintenance for backend's local VfdCache? If I follow the new code correctly, then it doesn't matter that you've unlink()'d to take care of the more obvious resource management chore. You can still have a reference leak like this, if I'm not mistaken, because you still have backend local state (local VfdCache) that is left totally decoupled with the new "shadow resource manager" for shared BufFiles. > As shown in 0008-hj-shared-buf-file-v8.patch. Thoughts? A less serious issue I've also noticed is that you add palloc() calls, implicitly using the current memory context, within buffile.c. BufFileOpenTagged() has some, for example. However, there is a note that we don't need to save the memory context when we open a BufFile because we always repalloc(). That is no longer the case here. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
Hi, Here is a new version of the patch series addressing complaints from Rafia, Peter, Andres and Robert -- see below. First, two changes not already covered in this thread: 1. Today Robert asked me a question off-list that I hadn't previously considered: since I am sharing tuples between backends, don't I have the same kind of transient record remapping problems that tqueue.c has to deal with? The answer must be yes, and in fact it's a trickier version because there are N 'senders' and N 'receivers' communicating via the shared hash table. So I decided to avoid the problem by not planning shared hash tables if the tuples could include transient RECORD types: see tlist_references_transient_type() in 0007-hj-shared-single-batch-v8.patch. Perhaps in the future we can find a way for parallel query to keep local types in sync, so this restriction could be lifted. (I've tested this with a specially modified build, because I couldn't figure out how to actually get any transient types to be considered in a parallel query, but if someone has a suggestion for a good query for that I'd love to put one into the regression test.) 2. Earlier versions included support for Shared Hash (= one worker builds, other workers wait, but we get to use work_mem * P memory) and Parallel Shared Hash (= all workers build). Shared Hash is by now quite hard to reach, since so many hash join inner plans are now parallelisable. I decided to remove support for it from the latest patch series: I think it adds cognitive load and patch lines for little or no gain. With time running out, I thought that it would be better to rip it out for now to try to simplify things and avoid some difficult questions about how to cost that mode. It could be added with a separate patch after some more study if it really does make some sense. >> On Mon, Mar 13, 2017 at 8:40 PM, Rafia Sabih >>wrote: >>> In an attempt to test v7 of this patch on TPC-H 20 scale factor I found a >>> few regressions, >>> Q21: 52 secs on HEAD and 400 secs with this patch As already mentioned there is a planner bug which we can fix separately from this patch series. Until that is resolved, please see that other thread[1] for the extra tweak require for sane results when testing Q21. Even with that tweak, there was a slight regression with fewer than 3 workers at 1GB for Q21. That turned out to be because the patched version was not always using as many workers as unpatched. To fix that, I had to rethink the deadlock avoidance system to make it a bit less conservative about giving up workers: see src/backend/utils/misc/leader_gate.c in 0007-hj-shared-single-batch-v8.patch. Here are some speed-up numbers comparing master to patched that I recorded on TPCH scale 10 with work_mem = 1GB. These are the queries whose plans change with the patch. Both master and v8 were patched with fix-neqseljoin-for-semi-joins.patch. query | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8 ---+---+---+---+---+---+---+---+---+--- Q3| 0.94x | 1.06x | 1.25x | 1.46x | 1.64x | 1.87x | 1.99x | 1.67x | 1.67x Q5| 1.17x | 1.03x | 1.23x | 1.27x | 1.44x | 0.56x | 0.95x | 0.94x | 1.16x Q7| 1.13x | 1.04x | 1.31x | 1.06x | 1.15x | 1.28x | 1.31x | 1.35x | 1.13x Q8| 0.99x | 1.13x | 1.23x | 1.22x | 1.36x | 0.42x | 0.82x | 0.78x | 0.81x Q9| 1.16x | 0.95x | 1.92x | 1.68x | 1.90x | 1.89x | 2.02x | 2.05x | 1.81x Q10 | 1.01x | 1.03x | 1.08x | 1.10x | 1.16x | 1.17x | 1.09x | 1.01x | 1.07x Q12 | 1.03x | 1.19x | 1.42x | 0.75x | 0.74x | 1.00x | 0.99x | 1.00x | 1.01x Q13 | 1.10x | 1.66x | 1.99x | 1.00x | 1.12x | 1.00x | 1.12x | 1.01x | 1.13x Q14 | 0.97x | 1.13x | 1.22x | 1.45x | 1.43x | 1.55x | 1.55x | 1.50x | 1.45x Q16 | 1.02x | 1.13x | 1.07x | 1.09x | 1.10x | 1.10x | 1.13x | 1.10x | 1.11x Q18 | 1.05x | 1.43x | 1.33x | 1.21x | 1.07x | 1.57x | 1.76x | 1.09x | 1.09x Q21 | 0.99x | 1.01x | 1.07x | 1.18x | 1.28x | 1.37x | 1.63x | 1.26x | 1.60x These tests are a bit short and noisy and clearly there are some strange dips in there that need some investigation but the trend is positive. Here are some numbers from some simple test joins, so that you can see the raw speedup of large hash joins without all the other things going on in those TPCH plans. I executed 1-join, 2-join and 3-join queries like this: CREATE TABLE simple AS SELECT generate_series(1, 1000) AS id, 'aa'; ANALYZE simple; SELECT COUNT(*) FROM simple r JOIN simple s USING (id); SELECT COUNT(*) FROM simple r JOIN simple s USING (id) JOIN simple t USING (id); SELECT COUNT(*) FROM simple r JOIN simple s USING (id) JOIN simple t USING (id) JOIN simple u USING (id); Unpatched master can make probing go faster by adding workers, but not building, so in these self-joins the ability to scale with more CPUs is limited (here w = 1
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Tue, Mar 14, 2017 at 8:03 AM, Thomas Munrowrote: > On Mon, Mar 13, 2017 at 8:40 PM, Rafia Sabih > wrote: >> In an attempt to test v7 of this patch on TPC-H 20 scale factor I found a >> few regressions, >> Q21: 52 secs on HEAD and 400 secs with this patch > > Thanks Rafia. Robert just pointed out off-list that there is a bogus > 0 row estimate in here: > > -> Parallel Hash Semi Join (cost=1006599.34..1719227.30 rows=0 > width=24) (actual time=38716.488..100933.250 rows=7315896 loops=5) > > Will investigate, thanks. There are two problems here. 1. There is a pre-existing cardinality estimate problem for semi-joins with <> filters. The big Q21 regression reported by Rafia is caused by that phenomenon, probably exacerbated by another bug that allowed 0 cardinality estimates to percolate inside the planner. Estimates have been clamped at or above 1.0 since her report by commit 1ea60ad6. I started a new thread to discuss that because it's unrelated to this patch, except insofar as it confuses the planner about Q21 (with or without parallelism). Using one possible selectivity tweak suggested by Tom Lane, I was able to measure significant speedups on otherwise unpatched master: https://www.postgresql.org/message-id/CAEepm%3D11BiYUkgXZNzMtYhXh4S3a9DwUP8O%2BF2_ZPeGzzJFPbw%40mail.gmail.com 2. If you compare master tweaked as above against the latest version of my patch series with the tweak, then the patched version always runs faster with 4 or more workers, but with only 1 or 2 workers Q21 is a bit slower... but not always. I realised that there was a bi-modal distribution of execution times. It looks like my 'early exit' protocol, designed to make tuple-queue deadlock impossible, is often causing us to lose a worker. I am working on that now. I have code changes for Peter G's and Andres's feedback queued up and will send a v8 series shortly, hopefully with a fix for problem 2 above. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Mon, Mar 13, 2017 at 8:40 PM, Rafia Sabihwrote: > In an attempt to test v7 of this patch on TPC-H 20 scale factor I found a > few regressions, > Q21: 52 secs on HEAD and 400 secs with this patch Thanks Rafia. Robert just pointed out off-list that there is a bogus 0 row estimate in here: -> Parallel Hash Semi Join (cost=1006599.34..1719227.30 rows=0 width=24) (actual time=38716.488..100933.250 rows=7315896 loops=5) Will investigate, thanks. > Q8: 8 secs on HEAD to 14 secs with patch Also looking into this one. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Thu, Mar 9, 2017 at 5:28 PM, Thomas Munrowrote: > On Wed, Mar 8, 2017 at 12:58 PM, Andres Freund wrote: > > 0002: Check hash join work_mem usage at the point of chunk allocation. > > > > Modify the existing hash join code to detect work_mem exhaustion at > > the point where chunks are allocated, instead of checking after every > > tuple insertion. This matches the logic used for estimating, and more > > importantly allows for some parallelism in later patches. > > > > diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/ > nodeHash.c > > index 406c180..af1b66d 100644 > > --- a/src/backend/executor/nodeHash.c > > +++ b/src/backend/executor/nodeHash.c > > @@ -48,7 +48,8 @@ static void ExecHashSkewTableInsert(HashJoinTable > hashtable, > > int bucketNumber); > > static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable); > > > > -static void *dense_alloc(HashJoinTable hashtable, Size size); > > +static void *dense_alloc(HashJoinTable hashtable, Size size, > > +bool respect_work_mem); > > > > I still dislike this, but maybe Robert's point of: > > > > On 2017-02-16 08:57:21 -0500, Robert Haas wrote: > >> On Wed, Feb 15, 2017 at 9:36 PM, Andres Freund > wrote: > >> > Isn't it kinda weird to do this from within dense_alloc()? I mean > that > >> > dumps a lot of data to disk, frees a bunch of memory and so on - not > >> > exactly what "dense_alloc" implies. Isn't the free()ing part also > >> > dangerous, because the caller might actually use some of that memory, > >> > like e.g. in ExecHashRemoveNextSkewBucket() or such. I haven't looked > >> > deeply enough to check whether that's an active bug, but it seems like > >> > inviting one if not. > >> > >> I haven't looked at this, but one idea might be to just rename > >> dense_alloc() to ExecHashBlahBlahSomething(). If there's a real > >> abstraction layer problem here then we should definitely fix it, but > >> maybe it's just the angle at which you hold your head. > > > > Is enough. > > There is a problem here. It can determine that it needs to increase > the number of batches, effectively splitting the current batch, but > then the caller goes on to insert the current tuple anyway, even > though it may no longer belong in this batch. I will post a fix for > that soon. I will also refactor it so that it doesn't do that work > inside dense_alloc. You're right, that's too weird. > > In the meantime, here is a new patch series addressing the other > things you raised. > > > 0003: Scan for unmatched tuples in a hash join one chunk at a time. > > > > > > @@ -1152,8 +1155,65 @@ bool > > ExecScanHashTableForUnmatched(HashJoinState *hjstate, ExprContext > *econtext) > > { > > HashJoinTable hashtable = hjstate->hj_HashTable; > > - HashJoinTuple hashTuple = hjstate->hj_CurTuple; > > + HashJoinTuple hashTuple; > > + MinimalTuple tuple; > > + > > + /* > > +* First, process the queue of chunks holding tuples that are in > regular > > +* (non-skew) buckets. > > +*/ > > + for (;;) > > + { > > + /* Do we need a new chunk to scan? */ > > + if (hashtable->current_chunk == NULL) > > + { > > + /* Have we run out of chunks to scan? */ > > + if (hashtable->unmatched_chunks == NULL) > > + break; > > + > > + /* Pop the next chunk from the front of the > queue. */ > > + hashtable->current_chunk = > hashtable->unmatched_chunks; > > + hashtable->unmatched_chunks = > hashtable->current_chunk->next; > > + hashtable->current_chunk_index = 0; > > + } > > + > > + /* Have we reached the end of this chunk yet? */ > > + if (hashtable->current_chunk_index >= > hashtable->current_chunk->used) > > + { > > + /* Go around again to get the next chunk from > the queue. */ > > + hashtable->current_chunk = NULL; > > + continue; > > + } > > + > > + /* Take the next tuple from this chunk. */ > > + hashTuple = (HashJoinTuple) > > + (hashtable->current_chunk->data + > hashtable->current_chunk_index); > > + tuple = HJTUPLE_MINTUPLE(hashTuple); > > + hashtable->current_chunk_index += > > + MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len); > > + > > + /* Is it unmatched? */ > > + if (!HeapTupleHeaderHasMatch(tuple)) > > + { > > + TupleTableSlot *inntuple; > > + > > + /* insert hashtable's tuple into exec slot */ > > +
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Thu, Mar 9, 2017 at 3:58 AM, Thomas Munrowrote: > In the meantime, here is a new patch series addressing the other > things you raised. Please see my remarks on 0007-hj-shared-buf-file-v7.patch over on the "on_dsm_detach() callback and parallel tuplesort BufFile resource management" thread. They still apply to this latest version of the patch series. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Wed, Mar 8, 2017 at 12:58 PM, Andres Freundwrote: > 0002: Check hash join work_mem usage at the point of chunk allocation. > > Modify the existing hash join code to detect work_mem exhaustion at > the point where chunks are allocated, instead of checking after every > tuple insertion. This matches the logic used for estimating, and more > importantly allows for some parallelism in later patches. > > diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c > index 406c180..af1b66d 100644 > --- a/src/backend/executor/nodeHash.c > +++ b/src/backend/executor/nodeHash.c > @@ -48,7 +48,8 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable, > int bucketNumber); > static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable); > > -static void *dense_alloc(HashJoinTable hashtable, Size size); > +static void *dense_alloc(HashJoinTable hashtable, Size size, > +bool respect_work_mem); > > I still dislike this, but maybe Robert's point of: > > On 2017-02-16 08:57:21 -0500, Robert Haas wrote: >> On Wed, Feb 15, 2017 at 9:36 PM, Andres Freund wrote: >> > Isn't it kinda weird to do this from within dense_alloc()? I mean that >> > dumps a lot of data to disk, frees a bunch of memory and so on - not >> > exactly what "dense_alloc" implies. Isn't the free()ing part also >> > dangerous, because the caller might actually use some of that memory, >> > like e.g. in ExecHashRemoveNextSkewBucket() or such. I haven't looked >> > deeply enough to check whether that's an active bug, but it seems like >> > inviting one if not. >> >> I haven't looked at this, but one idea might be to just rename >> dense_alloc() to ExecHashBlahBlahSomething(). If there's a real >> abstraction layer problem here then we should definitely fix it, but >> maybe it's just the angle at which you hold your head. > > Is enough. There is a problem here. It can determine that it needs to increase the number of batches, effectively splitting the current batch, but then the caller goes on to insert the current tuple anyway, even though it may no longer belong in this batch. I will post a fix for that soon. I will also refactor it so that it doesn't do that work inside dense_alloc. You're right, that's too weird. In the meantime, here is a new patch series addressing the other things you raised. > 0003: Scan for unmatched tuples in a hash join one chunk at a time. > > > @@ -1152,8 +1155,65 @@ bool > ExecScanHashTableForUnmatched(HashJoinState *hjstate, ExprContext *econtext) > { > HashJoinTable hashtable = hjstate->hj_HashTable; > - HashJoinTuple hashTuple = hjstate->hj_CurTuple; > + HashJoinTuple hashTuple; > + MinimalTuple tuple; > + > + /* > +* First, process the queue of chunks holding tuples that are in > regular > +* (non-skew) buckets. > +*/ > + for (;;) > + { > + /* Do we need a new chunk to scan? */ > + if (hashtable->current_chunk == NULL) > + { > + /* Have we run out of chunks to scan? */ > + if (hashtable->unmatched_chunks == NULL) > + break; > + > + /* Pop the next chunk from the front of the queue. */ > + hashtable->current_chunk = > hashtable->unmatched_chunks; > + hashtable->unmatched_chunks = > hashtable->current_chunk->next; > + hashtable->current_chunk_index = 0; > + } > + > + /* Have we reached the end of this chunk yet? */ > + if (hashtable->current_chunk_index >= > hashtable->current_chunk->used) > + { > + /* Go around again to get the next chunk from the > queue. */ > + hashtable->current_chunk = NULL; > + continue; > + } > + > + /* Take the next tuple from this chunk. */ > + hashTuple = (HashJoinTuple) > + (hashtable->current_chunk->data + > hashtable->current_chunk_index); > + tuple = HJTUPLE_MINTUPLE(hashTuple); > + hashtable->current_chunk_index += > + MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len); > + > + /* Is it unmatched? */ > + if (!HeapTupleHeaderHasMatch(tuple)) > + { > + TupleTableSlot *inntuple; > + > + /* insert hashtable's tuple into exec slot */ > + inntuple = ExecStoreMinimalTuple(tuple, > + >hjstate->hj_HashTupleSlot, > + >false); /* do not pfree */ > +
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Wed, Mar 8, 2017 at 1:15 PM, Tom Lanewrote: > Andres Freund writes: >> +++ b/src/include/storage/barrier.h >> +#include "postgres.h" > >> Huh, that normally shouldn't be in a header. I see you introduced that >> in a bunch of other places too - that really doesn't look right to me. > > That is absolutely not project style and is not acceptable. > > The core reason why not is that postgres.h/postgres_fe.h/c.h have to be > the *first* inclusion in every compilation, for arcane portability reasons > you really don't want to know about. (Suffice it to say that on some > platforms, stdio.h isn't all that std.) Our coding rule for that is that > we put the appropriate one of these first in every .c file, while .h files > always assume that it's been included already. As soon as you break that > convention, it becomes unclear from looking at a .c file whether the > ordering requirement has been satisfied. Also, since now you've moved > the must-be-first requirement to some other header file(s), you risk > breakage when somebody applies another project convention about > alphabetizing #include references for all headers other than those magic > ones. Thanks for the explanation. Will post a new series addressing this and other complaints from Andres shortly. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
Andres Freundwrites: > +++ b/src/include/storage/barrier.h > +#include "postgres.h" > Huh, that normally shouldn't be in a header. I see you introduced that > in a bunch of other places too - that really doesn't look right to me. That is absolutely not project style and is not acceptable. The core reason why not is that postgres.h/postgres_fe.h/c.h have to be the *first* inclusion in every compilation, for arcane portability reasons you really don't want to know about. (Suffice it to say that on some platforms, stdio.h isn't all that std.) Our coding rule for that is that we put the appropriate one of these first in every .c file, while .h files always assume that it's been included already. As soon as you break that convention, it becomes unclear from looking at a .c file whether the ordering requirement has been satisfied. Also, since now you've moved the must-be-first requirement to some other header file(s), you risk breakage when somebody applies another project convention about alphabetizing #include references for all headers other than those magic ones. In short, don't even think of doing this. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
Hi, 0001: Do hash join work_mem accounting in chunks. Don't think there's much left to say. 0002: Check hash join work_mem usage at the point of chunk allocation. Modify the existing hash join code to detect work_mem exhaustion at the point where chunks are allocated, instead of checking after every tuple insertion. This matches the logic used for estimating, and more importantly allows for some parallelism in later patches. diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c index 406c180..af1b66d 100644 --- a/src/backend/executor/nodeHash.c +++ b/src/backend/executor/nodeHash.c @@ -48,7 +48,8 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable, int bucketNumber); static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable); -static void *dense_alloc(HashJoinTable hashtable, Size size); +static void *dense_alloc(HashJoinTable hashtable, Size size, +bool respect_work_mem); I still dislike this, but maybe Robert's point of: On 2017-02-16 08:57:21 -0500, Robert Haas wrote: > On Wed, Feb 15, 2017 at 9:36 PM, Andres Freundwrote: > > Isn't it kinda weird to do this from within dense_alloc()? I mean that > > dumps a lot of data to disk, frees a bunch of memory and so on - not > > exactly what "dense_alloc" implies. Isn't the free()ing part also > > dangerous, because the caller might actually use some of that memory, > > like e.g. in ExecHashRemoveNextSkewBucket() or such. I haven't looked > > deeply enough to check whether that's an active bug, but it seems like > > inviting one if not. > > I haven't looked at this, but one idea might be to just rename > dense_alloc() to ExecHashBlahBlahSomething(). If there's a real > abstraction layer problem here then we should definitely fix it, but > maybe it's just the angle at which you hold your head. Is enough. 0003: Scan for unmatched tuples in a hash join one chunk at a time. @@ -1152,8 +1155,65 @@ bool ExecScanHashTableForUnmatched(HashJoinState *hjstate, ExprContext *econtext) { HashJoinTable hashtable = hjstate->hj_HashTable; - HashJoinTuple hashTuple = hjstate->hj_CurTuple; + HashJoinTuple hashTuple; + MinimalTuple tuple; + + /* +* First, process the queue of chunks holding tuples that are in regular +* (non-skew) buckets. +*/ + for (;;) + { + /* Do we need a new chunk to scan? */ + if (hashtable->current_chunk == NULL) + { + /* Have we run out of chunks to scan? */ + if (hashtable->unmatched_chunks == NULL) + break; + + /* Pop the next chunk from the front of the queue. */ + hashtable->current_chunk = hashtable->unmatched_chunks; + hashtable->unmatched_chunks = hashtable->current_chunk->next; + hashtable->current_chunk_index = 0; + } + + /* Have we reached the end of this chunk yet? */ + if (hashtable->current_chunk_index >= hashtable->current_chunk->used) + { + /* Go around again to get the next chunk from the queue. */ + hashtable->current_chunk = NULL; + continue; + } + + /* Take the next tuple from this chunk. */ + hashTuple = (HashJoinTuple) + (hashtable->current_chunk->data + hashtable->current_chunk_index); + tuple = HJTUPLE_MINTUPLE(hashTuple); + hashtable->current_chunk_index += + MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len); + + /* Is it unmatched? */ + if (!HeapTupleHeaderHasMatch(tuple)) + { + TupleTableSlot *inntuple; + + /* insert hashtable's tuple into exec slot */ + inntuple = ExecStoreMinimalTuple(tuple, + hjstate->hj_HashTupleSlot, + false); /* do not pfree */ + econtext->ecxt_innertuple = inntuple; + + /* reset context each time (see below for explanation) */ + ResetExprContext(econtext); + return true; + } + } I suspect this might actually be slower than the current/old logic, because the current_chunk tests are repeated every loop. I think retaining the two loops the previous code had makes sense, i.e. one to find a relevant chunk, and one to iterate through all tuples in a chunk, checking for an unmatched one. Have you run a performance comparison pre/post this patch? I
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
Hi, On 2017-03-07 02:57:30 +1300, Thomas Munro wrote: > I'm not sure why nodeHashjoin.c is doing raw batchfile read/write > operations anyway; why not use tuplestore.c for that (as > tuplestore.c's comments incorrectly say is the case)? Another reason presumably is that using tuplestores would make it harder to control the amount of memory used - we do *not* want an extra set of work_mem used here, right? - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Wed, Mar 1, 2017 at 10:40 PM, Thomas Munrowrote: > I'm testing a new version which incorporates feedback from Andres and > Ashutosh, and is refactored to use a new SharedBufFileSet component to > handle batch files, replacing the straw-man implementation from the v5 > patch series. I've set this to waiting-on-author and will post v6 > tomorrow. I created a system for reference counted partitioned temporary files called SharedBufFileSet: see 0007-hj-shared-buf-file.patch. Then I ripped out the code for sharing batch files that I previously had cluttering up nodeHashjoin.c, and refactored it into a new component called a SharedTuplestore which wraps a SharedBufFileSet and gives it a tuple-based interface: see 0008-hj-shared-tuplestore.patch. The name implies aspirations of becoming a more generally useful shared analogue of tuplestore, but for now it supports only the exact access pattern needed for hash join batches ($10 wrench). It creates temporary files like this: base/pgsql_tmp/pgsql_tmp[pid].[set].[partition].[participant].[segment] I'm not sure why nodeHashjoin.c is doing raw batchfile read/write operations anyway; why not use tuplestore.c for that (as tuplestore.c's comments incorrectly say is the case)? Maybe because Tuplestore's interface doesn't support storing the extra hash value. In SharedTuplestore I solved that problem by introducing an optional fixed sized piece of per-tuple meta-data. Another thing that is different about SharedTuplestore is that it supports partitions, which is convenient for this project and probably other parallel projects too. In order for workers to be able to participate in reference counting schemes based on DSM segment lifetime, I had to give the Exec*InitializeWorker() functions access to the dsm_segment object, whereas previously they received only the shm_toc in order to access its contents. I invented ParallelWorkerContext which has just two members 'seg' and 'toc': see 0005-hj-let-node-have-seg-in-worker.patch. I didn't touch the FDW API or custom scan API where they currently take toc, though I can see that there is an argument that they should; changing those APIs seems like a bigger deal. Another approach would be to use ParallelContext, as passed into ExecXXXInitializeDSM, with the members that are not applicable to workers zeroed out. Thoughts? I got rid of the ExecDetachXXX stuff I had invented in the last version, because acf555bc fixed the problem a better way. I found that I needed to put use more than one toc entry for a single executor node, in order to reserve space for the inner and outer SharedTuplestore objects. So I invented a way to make more extra keys with PARALLEL_KEY_EXECUTOR_NTH(plan_node_id, N). -- Thomas Munro http://www.enterprisedb.com parallel-shared-hash-v6.tgz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Thu, Feb 16, 2017 at 9:08 PM, Thomas Munrowrote: > On Thu, Feb 16, 2017 at 3:36 PM, Andres Freund wrote: >> That's it for now... > > Thanks! Plenty for me to go away and think about. I will post a new > version soon. I'm testing a new version which incorporates feedback from Andres and Ashutosh, and is refactored to use a new SharedBufFileSet component to handle batch files, replacing the straw-man implementation from the v5 patch series. I've set this to waiting-on-author and will post v6 tomorrow. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Wed, Feb 15, 2017 at 9:36 PM, Andres Freundwrote: > 0002-hj-add-dtrace-probes-v5.patch > > Hm. I'm personally very unenthusiastic about addming more of these, and > would rather rip all of them out. I tend to believe that static > problems simply aren't a good approach for anything requiring a lot of > detail. But whatever. I'm not a big fan of either static problems or static probes, myself. > Isn't it kinda weird to do this from within dense_alloc()? I mean that > dumps a lot of data to disk, frees a bunch of memory and so on - not > exactly what "dense_alloc" implies. Isn't the free()ing part also > dangerous, because the caller might actually use some of that memory, > like e.g. in ExecHashRemoveNextSkewBucket() or such. I haven't looked > deeply enough to check whether that's an active bug, but it seems like > inviting one if not. I haven't looked at this, but one idea might be to just rename dense_alloc() to ExecHashBlahBlahSomething(). If there's a real abstraction layer problem here then we should definitely fix it, but maybe it's just the angle at which you hold your head. > To me that is a weakness in the ExecShutdownNode() API - imo child nodes > should get the chance to shutdown before the upper-level node. > ExecInitNode/ExecEndNode etc give individual nodes the freedom to do > things in the right order, but ExecShutdownNode() doesn't. I don't > quite see why we'd want to invent a separate ExecDetachNode() that'd be > called immediately before ExecShutdownNode(). Interestingly, the same point came up on the Parallel Bitmap Heap Scan thread. > An easy way to change that would be to return in the > ExecShutdownNode()'s T_GatherState case, and delegate the responsibility > of calling it on Gather's children to ExecShutdownGather(). > Alternatively we could make it a full-blown thing like ExecInitNode() > that every node needs to implement, but that seems a bit painful. I was thinking we should just switch things so that ExecShutdownNode() recurses first, and then does the current node. There's no real excuse for a node terminating the shutdown scan early, I think. > Or have I missed something here? > > Random aside: Wondered before if having to provide all executor > callbacks is a weakness of our executor integration, and whether it > shouldn't be a struct of callbacks instead... I honestly have no idea whether that would be better or worse from the CPU's point of view. > I think it's a bit weird that the parallel path now has an exit path > that the non-parallel path doesn't have. I'm not sure about this particular one, but in general those are pretty common. For example, look at the changes 569174f1be92be93f5366212cc46960d28a5c5cd made to _bt_first(). When you get there, you can discover that you aren't actually the first, and that in fact all the work is already complete, and there's nothing left for you to do but give up. > Hm. That seems a bit on the detailed side - if we're going that way it > seems likely that we'll end up with hundreds of wait events. I don't > think gradually evolving wait events into something like a query > progress framework is a good idea. I'm pretty strongly of the opinion that we should not reuse multiple wait events for the same purpose. The whole point of the wait event system is to identify what caused the wait. Having relatively recently done a ton of work to separate all of the waits in the system and identify them individually, I'm loathe to see us start melding things back together again. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Thu, Feb 16, 2017 at 3:36 PM, Andres Freundwrote: > Hi, Thanks for the review! > FWIW, I'd appreciate if you'd added a short commit message to the > individual patches - I find it helpful to have a littlebit more context > while looking at them than just the titles. Alternatively you can > include that text when re-posting the series, but it's imo just as easy > to have a short commit message (and just use format-patch). > > I'm for now using [1] as context. Ok, will do. > 0002-hj-add-dtrace-probes-v5.patch > > Hm. I'm personally very unenthusiastic about addming more of these, and > would rather rip all of them out. I tend to believe that static > problems simply aren't a good approach for anything requiring a lot of > detail. But whatever. Ok, I will get rid of these. Apparently you aren't the only committer who hates these. (I have some other thoughts on that but will save them for another time.) > 0003-hj-refactor-memory-accounting-v5.patch > @@ -424,15 +422,29 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, > bool useskew, > if (ntuples <= 0.0) > ntuples = 1000.0; > > - /* > -* Estimate tupsize based on footprint of tuple in hashtable... note > this > -* does not allow for any palloc overhead. The manipulations of > spaceUsed > -* don't count palloc overhead either. > -*/ > + /* Estimate tupsize based on footprint of tuple in hashtable. */ > > palloc overhead is still unaccounted for, no? In the chunked case that > might not be much, I realize that (so that comment should probably have > been updated when chunking was introduced). Yeah, it seemed like an obsolete comment, but I'll put it back as that isn't relevant to this patch. > - SizespaceUsed; /* memory space currently > used by tuples */ > + SizespaceUsed; /* memory space currently > used by hashtable */ > > It's not really the full hashtable, is it? The ->buckets array appears > to still be unaccounted for. It is actually the full hash table, and that is a change in this patch. See ExecHashTableCreate and ExecHashTableReset where is it set to nbuckets * sizeof(HashJoinTuple), so that at all times it holds the total size of buckets + all chunks. Unlike the code in master, where it's just the sum of all tuples while building, but then the bucket space is added at the end in MultiExecHash. > Looks ok. Thanks! > 0004-hj-refactor-batch-increases-v5.patch > > @@ -1693,10 +1689,12 @@ ExecHashRemoveNextSkewBucket(HashJoinTable hashtable) > } > > /* > - * Allocate 'size' bytes from the currently active HashMemoryChunk > + * Allocate 'size' bytes from the currently active HashMemoryChunk. If > + * 'respect_work_mem' is true, this may cause the number of batches to be > + * increased in an attempt to shrink the hash table. > */ > static void * > -dense_alloc(HashJoinTable hashtable, Size size) > +dense_alloc(HashJoinTable hashtable, Size size, bool respect_work_mem) > > { > HashMemoryChunk newChunk; > char *ptr; > @@ -1710,6 +1708,15 @@ dense_alloc(HashJoinTable hashtable, Size size) > */ > if (size > HASH_CHUNK_THRESHOLD) > { > + if (respect_work_mem && > + hashtable->growEnabled && > + hashtable->spaceUsed + HASH_CHUNK_HEADER_SIZE + size > > + hashtable->spaceAllowed) > + { > + /* work_mem would be exceeded: try to shrink hash > table */ > + ExecHashIncreaseNumBatches(hashtable); > + } > + > > Isn't it kinda weird to do this from within dense_alloc()? I mean that > dumps a lot of data to disk, frees a bunch of memory and so on - not > exactly what "dense_alloc" implies. Hmm. Yeah I guess that is a bit weird. My problem was that in the shared case (later patch), when you call this function it has a fast path and a slow path: the fast path just hands out more space from the existing chunk, but the slow path acquires an LWLock and makes space management decisions which have to be done sort of "atomically". In an earlier version I toyed with the idea of making dense_alloc return NULL if you said respect_work_mem = true and it determined that you need to increase the number of batches or go help other workers who have already started doing so. Then the batch increase stuff was not in here, but callers who say respect_work_mem = true (the build and reload loops) had to be prepared to loop and shrink if they get NULL, or some wrapper function needs to do that them. I will try it that way again. > Isn't the free()ing part also > dangerous, because the caller might actually use some of that memory, > like e.g. in ExecHashRemoveNextSkewBucket() or such. I haven't looked > deeply enough to check whether that's an active bug, but it seems like > inviting one if not.
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
Hi, Just to summarize what you could read between the lines in the previous mail: From a higher level POV the design here makes sense to me, I do however think there's a good chunk of code-level improvements needed. Regards, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
Hi, On 2017-02-13 23:57:00 +1300, Thomas Munro wrote: > Here's a new version to fix the problems reported by Rafia above. The > patch descriptions are as before but it starts from 0002 because 0001 > was committed as 7c5d8c16 (thanks, Andres). FWIW, I'd appreciate if you'd added a short commit message to the individual patches - I find it helpful to have a littlebit more context while looking at them than just the titles. Alternatively you can include that text when re-posting the series, but it's imo just as easy to have a short commit message (and just use format-patch). I'm for now using [1] as context. 0002-hj-add-dtrace-probes-v5.patch Hm. I'm personally very unenthusiastic about addming more of these, and would rather rip all of them out. I tend to believe that static problems simply aren't a good approach for anything requiring a lot of detail. But whatever. 0003-hj-refactor-memory-accounting-v5.patch @@ -424,15 +422,29 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew, if (ntuples <= 0.0) ntuples = 1000.0; - /* -* Estimate tupsize based on footprint of tuple in hashtable... note this -* does not allow for any palloc overhead. The manipulations of spaceUsed -* don't count palloc overhead either. -*/ + /* Estimate tupsize based on footprint of tuple in hashtable. */ palloc overhead is still unaccounted for, no? In the chunked case that might not be much, I realize that (so that comment should probably have been updated when chunking was introduced). - SizespaceUsed; /* memory space currently used by tuples */ + SizespaceUsed; /* memory space currently used by hashtable */ It's not really the full hashtable, is it? The ->buckets array appears to still be unaccounted for. Looks ok. 0004-hj-refactor-batch-increases-v5.patch @@ -1693,10 +1689,12 @@ ExecHashRemoveNextSkewBucket(HashJoinTable hashtable) } /* - * Allocate 'size' bytes from the currently active HashMemoryChunk + * Allocate 'size' bytes from the currently active HashMemoryChunk. If + * 'respect_work_mem' is true, this may cause the number of batches to be + * increased in an attempt to shrink the hash table. */ static void * -dense_alloc(HashJoinTable hashtable, Size size) +dense_alloc(HashJoinTable hashtable, Size size, bool respect_work_mem) { HashMemoryChunk newChunk; char *ptr; @@ -1710,6 +1708,15 @@ dense_alloc(HashJoinTable hashtable, Size size) */ if (size > HASH_CHUNK_THRESHOLD) { + if (respect_work_mem && + hashtable->growEnabled && + hashtable->spaceUsed + HASH_CHUNK_HEADER_SIZE + size > + hashtable->spaceAllowed) + { + /* work_mem would be exceeded: try to shrink hash table */ + ExecHashIncreaseNumBatches(hashtable); + } + Isn't it kinda weird to do this from within dense_alloc()? I mean that dumps a lot of data to disk, frees a bunch of memory and so on - not exactly what "dense_alloc" implies. Isn't the free()ing part also dangerous, because the caller might actually use some of that memory, like e.g. in ExecHashRemoveNextSkewBucket() or such. I haven't looked deeply enough to check whether that's an active bug, but it seems like inviting one if not. 0005-hj-refactor-unmatched-v5.patch I'm a bit confused as to why unmatched tuple scan is a good parallelism target, but I might see later... 0006-hj-barrier-v5.patch Skipping that here. 0007-hj-exec-detach-node-v5.patch Hm. You write elsewhere: > By the time ExecEndNode() runs in workers, ExecShutdownNode() has > already run. That's done on purpose because, for example, the hash > table needs to survive longer than the parallel environment to allow > EXPLAIN to peek at it. But it means that the Gather node has thrown > out the shared memory before any parallel-aware node below it gets to > run its Shutdown and End methods. So I invented ExecDetachNode() > which runs before ExecShutdownNode(), giving parallel-aware nodes a > chance to say goodbye before their shared memory vanishes. Better > ideas? To me that is a weakness in the ExecShutdownNode() API - imo child nodes should get the chance to shutdown before the upper-level node. ExecInitNode/ExecEndNode etc give individual nodes the freedom to do things in the right order, but ExecShutdownNode() doesn't. I don't quite see why we'd want to invent a separate ExecDetachNode() that'd be called immediately before ExecShutdownNode(). An easy way to change that would be to return in the ExecShutdownNode()'s T_GatherState case, and delegate the responsibility of calling it on Gather's children to ExecShutdownGather(). Alternatively we could make it a full-blown thing like ExecInitNode() that every node needs to implement, but that seems a
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
Out of archeological curiosity, I was digging around in the hash join code and RCS history from Postgres 4.2[1], and I was astounded to discover that it had a parallel executor for Sequent SMP systems and was capable of parallel hash joins as of 1991. At first glance, it seems to follow approximately the same design as I propose: share a hash table and use a barrier to coordinate the switch from build phase to probe phase and deal with later patches. It uses mmap to get space and then works with relative pointers. See src/backend/executor/n_hash.c and src/backend/executor/n_hashjoin.c. Some of this might be described in Wei Hong's PhD thesis[2] which I haven't had the pleasure of reading yet. The parallel support is absent from the first commit in our repo (1996), but there are some vestiges like RelativeAddr and ABSADDR used to access the hash table (presumably needlessly) and also some mentions of parallel machines in comments that survived up until commit 26069a58 (1999). [1] http://db.cs.berkeley.edu/postgres.html [2] http://db.cs.berkeley.edu/papers/ERL-M93-28.pdf -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Thu, Feb 9, 2017 at 2:03 AM, Ashutosh Bapatwrote: >> >> 0004-hj-refactor-batch-increases-v4.patch: >> >> Modify the existing hash join code to detect work_mem exhaustion at >> the point where chunks are allocated, instead of checking after every >> tuple insertion. This matches the logic used for estimating, and more >> importantly allows for some parallelism in later patches. > > The patch has three changes > 1. change dense_alloc() to accept respect_workmem argument and use it > within the function. > 2. Move call to ExecHashIncreaseNumBatches() into dense_alloc() from > ExecHashTableInsert() to account for memory before inserting new tuple > 3. Check growEnabled before calling ExecHashIncreaseNumBatches(). Thanks for the review! > I think checking growEnabled within ExecHashIncreaseNumBatches() is > more easy to maintain that checking at every caller. If someone is to > add a caller tomorrow, s/he has to remember to add the check. Hmm. Yeah. In the later 0010 patch ExecHashIncreaseNumBatches will be used in a slightly different way -- not for making decisions or performing the hash table shrink, but only for reallocating the batch arrays. I will see if putting the growEnabled check back in there in the 0004 patch and then refactoring in a later patch makes more sense to someone reviewing the patches independently, for the next version. > It might be better to add some comments in > ExecHashRemoveNextSkewBucket() explaining why dense_alloc() should be > called with respect_work_mem = false? ExecHashSkewTableInsert() does > call ExecHashIncreaseNumBatches() after calling > ExecHashRemoveNextSkewBucket() multiple times, so it looks like we do > expect increase in space used and thus go beyond work_mem for a short > while. Is there a way we can handle this case in dense_alloc()? Right, that needs some explanation, which I'll add for the next version. The explanation is that while 'shrinking' the hash table, we may need to go over the work_mem limit by one chunk for a short time. That is already true in master, but by moving the work_mem checks into dense_alloc I ran into the problem that dense_alloc might decide to shrink the hash table which needs to call dense alloc. Shrinking works by spinning through all the chunks copying only the tuples we want to keep into new chunks and freeing the old chunks as we go. We will temporarily go one chunk over work_mem when we allocate the first new chunk but before we've freed the first old one. We don't want shrink operations to trigger recursive shrink operations, so we disable respect for work_mem when calling it from ExecHashIncreaseNumBatches. In the course of regular hash table loading, we want to respect work_mem. Looking at the v5 patch series I posted yesterday, I see that in fact ExecHashIncreaseNumBatches calls dense_alloc with respect_work_mem = true in the 0004 patch, and then I corrected that mistake in the 0008 patch; I'll move the correction back to the 0004 patch in the next version. To reach ExecHashIncreaseNumBatches, see the "ugly" query in hj-test-queries.sql (posted with v5). In ExecHashRemoveNextSkewBucket I preserved the existing behaviour of not caring about work_mem when performing the rare operation of copying a tuple from the skew bucket into a dense_alloc memory chunk so it can be inserted into a regular (non-skew) bucket. > Is it possible that increasing the number of batches changes the > bucket number of the tuple being inserted? If so, should we > recalculate the bucket and batch of the tuple being inserted? No -- see the function documentation for ExecHashGetBucketAndBatch. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Thu, Feb 2, 2017 at 4:57 PM, Rafia Sabihwrote: > On Thu, Feb 2, 2017 at 1:19 AM, Thomas Munro > wrote: >> On Thu, Feb 2, 2017 at 3:34 AM, Rafia Sabih >> wrote: >>> [ regressions ] >> >> Thanks Rafia. At first glance this plan is using the Parallel Shared >> Hash in one place where it should pay off, that is loading the orders >> table, but the numbers are terrible. I noticed that it uses batch >> files and then has to increase the number of batch files, generating a >> bunch of extra work, even though it apparently overestimated the >> number of rows, though that's only ~9 seconds of ~60. I am >> investigating. > > Hi Thomas, > Apart from the previously reported regression, there appear one more > issue in this set of patches. At times, running a query using parallel > hash it hangs up and all the workers including the master shows the > following backtrace, Here's a new version to fix the problems reported by Rafia above. The patch descriptions are as before but it starts from 0002 because 0001 was committed as 7c5d8c16 (thanks, Andres). First, some quick master-vs-patch numbers from the queries listed with regressions, using TPCH dbgen scale 10, work_mem = 64MB, max_parallel_workers_per_gather = 4, shared_buffers = 8GB (the numbers themselves not comparable as different scale and different hardware). Better except for Q5 and Q8, which for some mysterious reason plans only one worker and then loses. I'm looking into that. Q3 19917.682 -> 8649.822 Q5 4149.983 -> 4192.551 Q7 14453.721 -> 10303.911 Q8 1981.540 -> 8030.264 Q9 26928.102 -> 17384.607 Q10 16955.240 -> 14563.787 I plan to explore the performance space with a range of worker numbers and work_mem sizes and do some analysis; more soon. Changes: 1. Fixed two bugs that resulted in ExecHashShrink sometimes hanging, as reported by Rafia. (1) When splitting the large v3 patch up into smaller patches for v4, I'd managed to lose the line that initialises shared->shrink_barrier, causing some occasional strange behaviour. (2) I found a bug[1] in condition_variable.c that could cause hangs and fixed that via a separate patch and the fix was committed as 3f3d60d3 (thanks, Robert). 2. Simplified barrier.c by removing BarrierWaitSet(), because that turned out to be unnecessary to implement rescan as I'd originally thought, and was incompatible with the way BarrierDetach() works. The latter assumes that the phase only ever increments, so that combination of features was broken. 3. Sorted out the hash table sizing logic that was previously leading to some strange decisions about batches. This involved putting the total estimated number of inner rows into the path and plan when there is a partial inner plan, because plan_rows only has the partial number. I need to size the hash table correctly at execution time. It seems a bit strange to do that specifically and only for Hash (see rows_total in the 0008 patch)... should there be some more generic way? Should total rows go into Plan rather than HashPlan, or perhaps the parallel divisor should go somewhere? 4. Comments fixed and added based on Ashutosh's feedback on patch 0003. 5. Various small bug fixes. I've also attached a small set of test queries that hit the four "modes" (for want of a better word) of our hash join algorithm for dealing with different memory conditions, which I've nicknamed thus: 1. "Good": We estimate that the hash table will fit in work_mem, and at execution time it does. This patch makes that more likely because [Parallel] Shared Hash gets to use more work_mem as discussed. 2. "Bad": We estimate that the hash table won't fit in work_mem, but that if we partition it into N batches using some bits from the hash value then each batch will fit in work_mem. At execution time, each batch does indeed fit into work_mem. This is not ideal, because we have to write out and read back in N - (1 / N) inner and outer tuples (ie all batches except the first one, although actually costsize.c always charges for all of them). But it may still be better than other plans, and the IO is sequential. Currently Shared Hash shouldn't be selected over (private) Hash if it would require batching anyway due to the cpu_shared_tuple_cost tie-breaker: on the one had it avoids a bunch of copies of the batch files being written out, but on the other it introduces a bunch of synchronisation overhead. Parallel Shared Hash is fairly likely to be chosen if possible be due to division of the inner relation's cost outweighing cpu_shared_tuple_cost. 3. "Ugly": We planned for "good" or "bad" mode, but we ran out of work_mem at some point during execution: this could be during the initial hash table load, or while loading a subsequent batch. So now we double the number of batches, splitting the current batch and all batches that haven't been processed yet into two in the hope of shrinking the
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
> > 0004-hj-refactor-batch-increases-v4.patch: > > Modify the existing hash join code to detect work_mem exhaustion at > the point where chunks are allocated, instead of checking after every > tuple insertion. This matches the logic used for estimating, and more > importantly allows for some parallelism in later patches. The patch has three changes 1. change dense_alloc() to accept respect_workmem argument and use it within the function. 2. Move call to ExecHashIncreaseNumBatches() into dense_alloc() from ExecHashTableInsert() to account for memory before inserting new tuple 3. Check growEnabled before calling ExecHashIncreaseNumBatches(). I think checking growEnabled within ExecHashIncreaseNumBatches() is more easy to maintain that checking at every caller. If someone is to add a caller tomorrow, s/he has to remember to add the check. It might be better to add some comments in ExecHashRemoveNextSkewBucket() explaining why dense_alloc() should be called with respect_work_mem = false? ExecHashSkewTableInsert() does call ExecHashIncreaseNumBatches() after calling ExecHashRemoveNextSkewBucket() multiple times, so it looks like we do expect increase in space used and thus go beyond work_mem for a short while. Is there a way we can handle this case in dense_alloc()? Is it possible that increasing the number of batches changes the bucket number of the tuple being inserted? If so, should we recalculate the bucket and batch of the tuple being inserted? -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Thu, Feb 2, 2017 at 4:57 PM, Rafia Sabihwrote: > Apart from the previously reported regression, there appear one more > issue in this set of patches. At times, running a query using parallel > hash it hangs up and all the workers including the master shows the > following backtrace, Ugh, thanks. Investigating. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Thu, Feb 2, 2017 at 1:19 AM, Thomas Munrowrote: > On Thu, Feb 2, 2017 at 3:34 AM, Rafia Sabih > wrote: >> 9 | 62928.88 | 59077.909 > > Thanks Rafia. At first glance this plan is using the Parallel Shared > Hash in one place where it should pay off, that is loading the orders > table, but the numbers are terrible. I noticed that it uses batch > files and then has to increase the number of batch files, generating a > bunch of extra work, even though it apparently overestimated the > number of rows, though that's only ~9 seconds of ~60. I am > investigating. Hi Thomas, Apart from the previously reported regression, there appear one more issue in this set of patches. At times, running a query using parallel hash it hangs up and all the workers including the master shows the following backtrace, #0 0x3fff880c7de8 in __epoll_wait_nocancel () from /lib64/power8/libc.so.6 #1 0x104e2718 in WaitEventSetWaitBlock (set=0x100157bde90, cur_timeout=-1, occurred_events=0x3fffdbe69698, nevents=1) at latch.c:998 #2 0x104e255c in WaitEventSetWait (set=0x100157bde90, timeout=-1, occurred_events=0x3fffdbe69698, nevents=1, wait_event_info=134217745) at latch.c:950 #3 0x10512970 in ConditionVariableSleep (cv=0x3ffd736e05a4, wait_event_info=134217745) at condition_variable.c:132 #4 0x104dbb1c in BarrierWaitSet (barrier=0x3ffd736e0594, new_phase=1, wait_event_info=134217745) at barrier.c:97 #5 0x104dbb9c in BarrierWait (barrier=0x3ffd736e0594, wait_event_info=134217745) at barrier.c:127 #6 0x103296a8 in ExecHashShrink (hashtable=0x3ffd73747dc0) at nodeHash.c:1075 #7 0x1032c46c in dense_alloc_shared (hashtable=0x3ffd73747dc0, size=40, shared=0x3fffdbe69eb8, respect_work_mem=1 '\001') at nodeHash.c:2618 #8 0x1032a2f0 in ExecHashTableInsert (hashtable=0x3ffd73747dc0, slot=0x100158f9e90, hashvalue=2389907270) at nodeHash.c:1476 #9 0x10327fd0 in MultiExecHash (node=0x100158f9800) at nodeHash.c:296 #10 0x10306730 in MultiExecProcNode (node=0x100158f9800) at execProcnode.c:577 The issue is not deterministic and straightforwardly reproducible, sometimes after make clean, etc. queries run sometimes they hang up again. I wanted to bring this to your notice hoping you might be faster than me in picking up the exact reason behind this anomaly. -- Regards, Rafia Sabih EnterpriseDB: http://www.enterprisedb.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Thu, Feb 2, 2017 at 3:34 AM, Rafia Sabihwrote: > 9 | 62928.88 | 59077.909 Thanks Rafia. At first glance this plan is using the Parallel Shared Hash in one place where it should pay off, that is loading the orders table, but the numbers are terrible. I noticed that it uses batch files and then has to increase the number of batch files, generating a bunch of extra work, even though it apparently overestimated the number of rows, though that's only ~9 seconds of ~60. I am investigating. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
> >> However, ExecHashIncreaseNumBatches() may change the >> number of buckets; the patch does not seem to account for spaceUsed changes >> because of that. > > That's what this hunk is intended to do: > > @@ -795,6 +808,12 @@ ExecHashIncreaseNumBuckets(HashJoinTable hashtable) > TRACE_POSTGRESQL_HASH_INCREASE_BUCKETS(hashtable->nbuckets, > > hashtable->nbuckets_optimal); > > + /* account for the increase in space that will be used by buckets */ > + hashtable->spaceUsed += sizeof(HashJoinTuple) * > + (hashtable->nbuckets_optimal - hashtable->nbuckets); > + if (hashtable->spaceUsed > hashtable->spacePeak) > + hashtable->spacePeak = hashtable->spaceUsed; > + Sorry, I missed that hunk. You are right, it's getting accounted for. >> >> In ExecHashTableReset(), do we want to update spacePeak while setting >> spaceUsed. > > I figured there was no way that the new spaceUsed value could be > bigger than spacePeak, because we threw out all chunks and have just > the bucket array, and we had that number of buckets before, so > spacePeak must at least have been set to a number >= this number > either when we expanded to this many buckets, or when we created the > hashtable in the first place. Perhaps I should > Assert(hashtable->spaceUsed <= hashtable->spacePeak). That would help, better if you explain this with a comment before Assert. > >> While this patch tracks space usage more accurately, I am afraid we might be >> overdoing it; a reason why we don't track space usage accurately now. But I >> think I will leave it to be judged by someone who is more familiar with the >> code and possibly has historical perspective. > > Well it's not doing more work; it doesn't make any practical > difference whatsoever but it's technically doing less work than > master, by doing memory accounting only when acquiring a new 32KB > chunk. This patch does more work while counting the space used by buckets, I guess. AFAIU, right now, that happens only after the hash table is built completely. But that's fine. I am not worried about whether the it's less work or more. > But if by overdoing it you mean that no one really cares about > the tiny increase in accuracy so the patch on its own is a bit of a > waste of time, you're probably right. This is what I meant by overdoing; you have spelled it better. > Depending on tuple size, you > could imagine something like 64 bytes of header and unused space per > 32KB chunk that we're not accounting for, and that's only 0.2%. So I > probably wouldn't propose this refactoring just on accuracy grounds > alone. > > This refactoring is intended to pave the way for shared memory > accounting in the later patches, which would otherwise generate ugly > IPC if done for every time a tuple is allocated. I considered using > atomic add to count space per tuple, or maintaining per-backend local > subtotals and periodically summing. Then I realised that switching to > per-chunk accounting would fix the IPC problem AND be justifiable on > theoretical grounds. When we allocate a new 32KB chunk, we really are > using 32KB more of your memory. +1. Thanks for considering the comments. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Wed, Feb 1, 2017 at 2:10 AM, Ashutosh Bapatwrote: >> >> 0003-hj-refactor-memory-accounting-v4.patch: >> [...] >> > I looked at this patch. I agree that it accounts the memory usage more > accurately. Here are few comments. Thanks for the review! > spaceUsed is defined with comment > SizespaceUsed;/* memory space currently used by tuples */ > > In ExecHashTableCreate(), although the space is allocated for buckets, no > tuples are yet inserted, so no space is used by the tuples, so going strictly > by the comment, spaceUsed should be 0 in that function. But I think the patch > is accounting the spaceUsed more accurately. Without this patch, the actual > allocation might cross spaceAllowed without being noticed. With this patch > that's not possible. Probably we should change the comment to say memory space > currently allocated. Right, that comment is out of date. It is now the space used by the bucket array and the tuples. I will fix that in the next version. > However, ExecHashIncreaseNumBatches() may change the > number of buckets; the patch does not seem to account for spaceUsed changes > because of that. That's what this hunk is intended to do: @@ -795,6 +808,12 @@ ExecHashIncreaseNumBuckets(HashJoinTable hashtable) TRACE_POSTGRESQL_HASH_INCREASE_BUCKETS(hashtable->nbuckets, hashtable->nbuckets_optimal); + /* account for the increase in space that will be used by buckets */ + hashtable->spaceUsed += sizeof(HashJoinTuple) * + (hashtable->nbuckets_optimal - hashtable->nbuckets); + if (hashtable->spaceUsed > hashtable->spacePeak) + hashtable->spacePeak = hashtable->spaceUsed; + hashtable->nbuckets = hashtable->nbuckets_optimal; hashtable->log2_nbuckets = hashtable->log2_nbuckets_optimal; It knows that spaceUsed already includes the old bucket array (nbuckets), so it figures out how much bigger the new bucket array will be (nbuckets_optimal - nbuckets) and adds that. > Without this patch ExecHashTableInsert() used to account for the space used by > a single tuple inserted. The patch moves this calculation in dense_alloc() and > accounts for out-of-bound allocation for larger tuples. That's good. > > The change in ExecChooseHashTableSize() too looks fine. > > In ExecHashTableReset(), do we want to update spacePeak while setting > spaceUsed. I figured there was no way that the new spaceUsed value could be bigger than spacePeak, because we threw out all chunks and have just the bucket array, and we had that number of buckets before, so spacePeak must at least have been set to a number >= this number either when we expanded to this many buckets, or when we created the hashtable in the first place. Perhaps I should Assert(hashtable->spaceUsed <= hashtable->spacePeak). > While this patch tracks space usage more accurately, I am afraid we might be > overdoing it; a reason why we don't track space usage accurately now. But I > think I will leave it to be judged by someone who is more familiar with the > code and possibly has historical perspective. Well it's not doing more work; it doesn't make any practical difference whatsoever but it's technically doing less work than master, by doing memory accounting only when acquiring a new 32KB chunk. But if by overdoing it you mean that no one really cares about the tiny increase in accuracy so the patch on its own is a bit of a waste of time, you're probably right. Depending on tuple size, you could imagine something like 64 bytes of header and unused space per 32KB chunk that we're not accounting for, and that's only 0.2%. So I probably wouldn't propose this refactoring just on accuracy grounds alone. This refactoring is intended to pave the way for shared memory accounting in the later patches, which would otherwise generate ugly IPC if done for every time a tuple is allocated. I considered using atomic add to count space per tuple, or maintaining per-backend local subtotals and periodically summing. Then I realised that switching to per-chunk accounting would fix the IPC problem AND be justifiable on theoretical grounds. When we allocate a new 32KB chunk, we really are using 32KB more of your memory. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
> > 0003-hj-refactor-memory-accounting-v4.patch: > > Modify the existing hash join code to work in terms of chunks when > estimating and later tracking memory usage. This is probably more > accurate than the current tuple-based approach, because it tries to > take into account the space used by chunk headers and the wasted space > in chunks. In practice the difference is probably small, but it's > arguably more accurate; I did this because I need chunk-based > accounting the later patches. Also, make HASH_CHUNK_SIZE the actual > size of allocated chunks (ie the header information is included in > that size so we allocate exactly 32KB, not 32KB + a bit, for the > benefit of the dsa allocator which otherwise finishes up allocating > 36KB). > I looked at this patch. I agree that it accounts the memory usage more accurately. Here are few comments. spaceUsed is defined with comment SizespaceUsed;/* memory space currently used by tuples */ In ExecHashTableCreate(), although the space is allocated for buckets, no tuples are yet inserted, so no space is used by the tuples, so going strictly by the comment, spaceUsed should be 0 in that function. But I think the patch is accounting the spaceUsed more accurately. Without this patch, the actual allocation might cross spaceAllowed without being noticed. With this patch that's not possible. Probably we should change the comment to say memory space currently allocated. However, ExecHashIncreaseNumBatches() may change the number of buckets; the patch does not seem to account for spaceUsed changes because of that. Without this patch ExecHashTableInsert() used to account for the space used by a single tuple inserted. The patch moves this calculation in dense_alloc() and accounts for out-of-bound allocation for larger tuples. That's good. The change in ExecChooseHashTableSize() too looks fine. In ExecHashTableReset(), do we want to update spacePeak while setting spaceUsed. While this patch tracks space usage more accurately, I am afraid we might be overdoing it; a reason why we don't track space usage accurately now. But I think I will leave it to be judged by someone who is more familiar with the code and possibly has historical perspective. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Sat, Jan 28, 2017 at 10:03 AM, Thomas Munrowrote: > I have broken this up into a patch series, harmonised the private vs > shared hash table code paths better and fixed many things including > the problems with rescans and regression tests mentioned upthread. > You'll see that one of the patches is that throwaway BufFile > import/export facility, which I'll replace with your code as > discussed. Patch moved to CF 2017-03. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Sat, Jan 7, 2017 at 9:01 AM, Thomas Munrowrote: > Stepping back a bit, I am aware of the following approaches to hash > join parallelism: > > 1. Run the inner plan and build a private hash table in each > participant [...]. > > 2. Run a partition-wise hash join[1]. [...] > > 3. Repartition the data on the fly, and then run a partition-wise > hash join. [...] > > 4. Scatter both the inner and outer plans arbitrarily across > participants [...], and build a shared hash > table. [...] > > [...] I suspect that 4 is probably a better > fit than 3 for Postgres today, because the communication overhead of > shovelling nearly all tuples through extra tuple queues to route them > to the right hash table would surely be very high, though I can see > that it's very attractive to have a reusable tuple repartitioning > operator and then run k disjoint communication-free joins (again, > without code change to the join operator, and to the benefit of all > join operators). On this topic, I recently stumbled on the 2011 paper "Design and Evaluation of Main Memory Hash Join Algorithms for Multi-core CPUs"[1] and found it reassuring. It compares simple shared hash tables to some state-of-the-art repartitioning approaches (including the radix join algorithm which performs the amazing feat of building a lot of cacheline-sized hash tables and then runs with minimal cache misses). From the introduction: "Second, we show that an algorithm that does not do any partitioning, but simply constructs a single shared hash table on the build relation often outperforms more complex algorithms. This simple “no-partitioning” hash join algorithm is robust to sub-optimal parameter choices by the optimizer, and does not require any knowledge of the characteristics of the input to work well. To the best of our knowledge, this simple hash join technique differs from what is currently implemented in existing DBMSs for multi-core hash join processing, and offers a tantalizingly simple, efficient, and robust technique for implementing the hash join operation." "Finally, we show that the simple “no-partitioning” hash join algorithm takes advantage of intrinsic hardware optimizations to handle skew. As a result, this simple hash join technique often benefits from skew and its relative performance increases as the skew increases! This property is a big advancement over the state-of-the-art methods, as it is important to have methods that can gracefully handle skew in practice [8]." (That relates to SMT pipelining compensating for the extra cacheline misses during probing by doing thread A's work while waiting for thread B's memory to be fetched.) From the conclusion: "... Our results show that a simple hash join technique that does not do any partitioning of the input relations often outperforms the other more complex partitioning-based join alternatives. In addition, the relative performance of this simple hash join technique rapidly improves with increasing skew, and it outperforms every other algorithm in the presence of even small amounts of skew." For balance, the authors of a 2013 paper "Main-Memory Hash Joins on Multi-Core CPUs: Tuning to the Underlying Hardware"[2] are less keen on the simple "hardware-oblivious" "no partitioning" approach and don't buy the other paper's ideas about SMT. Incidentally, their results on the benefits of large (huge) pages are interesting (table VI) and suggest that huge page support for DSM segments could be good here. [1] https://pdfs.semanticscholar.org/9de4/b32f2c7b630a4f6aae6994a362a46c7c49e9.pdf [2] https://www.inf.ethz.ch/personal/cagri.balkesen/publications/parallel-joins-icde13.pdf -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
Hi Thomas, On Fri, Jan 27, 2017 at 5:03 PM, Thomas Munrowrote: > I have broken this up into a patch series, harmonised the private vs > shared hash table code paths better and fixed many things including > the problems with rescans and regression tests mentioned upthread. > You'll see that one of the patches is that throwaway BufFile > import/export facility, which I'll replace with your code as > discussed. I'll try to get back to this ASAP, but expect to be somewhat busy next week. Next week will be my last week at Heroku. It was not an easy decision for me to leave Heroku, but I felt it was time for a change. I am very grateful to have had the opportunity. I have learned an awful lot during my time at the company. It has been excellent to have an employer that has been so supportive of my work on Postgres this whole time. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Fri, Jan 13, 2017 at 2:36 PM, Peter Geogheganwrote: > [...] > Yeah. That's basically what the BufFile unification process can > provide you with (or will, once I get around to implementing the > refcount thing, which shouldn't be too hard). As already noted, I'll > also want to make it defer creation of a leader-owned segment, unless > and until that proves necessary, which it never will for hash join. Hi Peter, I have broken this up into a patch series, harmonised the private vs shared hash table code paths better and fixed many things including the problems with rescans and regression tests mentioned upthread. You'll see that one of the patches is that throwaway BufFile import/export facility, which I'll replace with your code as discussed. The three 'refactor' patches change the existing hash join code to work in terms of chunks in more places. These may be improvements in their own right, but mainly they pave the way for parallelism. The later patches introduce single-batch and then multi-batch shared tables. The patches in the attached tarball are: 0001-nail-down-regression-test-row-order-v4.patch: A couple of regression tests would fail with later refactoring that changes the order of unmatched rows emitted by hash joins. So first, let's fix that by adding ORDER BY in those places, without any code changes. 0002-hj-add-dtrace-probes-v4.patch: Before making any code changes, let's add some dtrace probes so that we can measure time spent doing different phases of hash join work before and after the later changes. The main problem with the probes as I have them here (and the extra probes inserted by later patches in the series) is that interesting query plans contain multiple hash joins so these get all mixed up when you're trying to measure stuff, so perhaps I should pass executor node IDs into all the probes. More on this later. (If people don't want dtrace probes in the executor, I'm happy to omit them and maintain that kind of thing locally for my own testing purposes...) 0003-hj-refactor-memory-accounting-v4.patch: Modify the existing hash join code to work in terms of chunks when estimating and later tracking memory usage. This is probably more accurate than the current tuple-based approach, because it tries to take into account the space used by chunk headers and the wasted space in chunks. In practice the difference is probably small, but it's arguably more accurate; I did this because I need chunk-based accounting the later patches. Also, make HASH_CHUNK_SIZE the actual size of allocated chunks (ie the header information is included in that size so we allocate exactly 32KB, not 32KB + a bit, for the benefit of the dsa allocator which otherwise finishes up allocating 36KB). 0004-hj-refactor-batch-increases-v4.patch: Modify the existing hash join code to detect work_mem exhaustion at the point where chunks are allocated, instead of checking after every tuple insertion. This matches the logic used for estimating, and more importantly allows for some parallelism in later patches. 0005-hj-refactor-unmatched-v4.patch: Modifies the existing hash join code to handle unmatched tuples in right/full joins chunk-by-chunk. This is probably a cache-friendlier scan order anyway, but the real goal is to provide a natural grain for parallelism in a later patch. 0006-hj-barrier-v4.patch: The patch from a nearby thread previously presented as a dependency of this project. It might as well be considered part of this patch series. 0007-hj-exec-detach-node-v4.patch By the time ExecEndNode() runs in workers, ExecShutdownNode() has already run. That's done on purpose because, for example, the hash table needs to survive longer than the parallel environment to allow EXPLAIN to peek at it. But it means that the Gather node has thrown out the shared memory before any parallel-aware node below it gets to run its Shutdown and End methods. So I invented ExecDetachNode() which runs before ExecShutdownNode(), giving parallel-aware nodes a chance to say goodbye before their shared memory vanishes. Better ideas? 0008-hj-shared-single-batch-v4.patch: Introduces hash joins with "Shared Hash" and "Parallel Shared Hash" nodes, for single-batch joins only. If the planner has a partial inner plan, it'll pick a Parallel Shared Hash plan to divide that over K participants. Failing that, if the planner has a parallel-safe inner plan and thinks that it can avoid batching by using work_mem * K memory (shared by all K participants), it will now use a Shared Hash. Otherwise it'll typically use a Hash plan as before. Without the later patches, it will blow through work_mem * K if it turns out to have underestimated the hash table size, because it lacks infrastructure for dealing with batches. The trickiest thing at this point in the series is that participants (workers and the leader) can show up at any time, so there are three places that provide synchronisation with a parallel
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Wed, Jan 11, 2017 at 7:37 PM, Thomas Munrowrote: > Hmm. Yes, that is an entirely bogus use of isInterXact. I am > thinking about how to fix that with refcounts. Cool. As I said, the way I'd introduce refcounts would not be very different from what I've already done -- there'd still be a strong adherence to the use of resource managers to clean-up, with that including exactly one particular backend doing the extra step of deletion. The refcount only changes which backend does that extra step in corner cases, which is conceptually a very minor change. >> I don't think it's right for buffile.c to know anything about file >> paths directly -- I'd say that that's a modularity violation. >> PathNameOpenFile() is called by very few callers at the moment, all of >> them very low level (e.g. md.c), but you're using it within buffile.c >> to open a path to the file that you obtain from shared memory > > Hmm. I'm not seeing the modularity violation. buffile.c uses > interfaces already exposed by fd.c to do this: OpenTemporaryFile, > then FilePathName to find the path, then PathNameOpenFile to open from > another process. I see that your approach instead has client code > provide more meta data so that things can be discovered, which may > well be a much better idea. Indeed, my point was that the metadata thing would IMV be better. buffile.c shouldn't need to know about file paths, etc. Instead, caller should pass BufFileImport()/BufFileUnify() simple private state sufficient for routine to discover all details itself, based on a deterministic scheme. In my tuplesort patch, that piece of state is: /* + * BufFileOp is an identifier for a particular parallel operation involving + * temporary files. Parallel temp file operations must be discoverable across + * processes based on these details. + * + * These fields should be set by BufFileGetIdent() within leader process. + * Identifier BufFileOp makes temp files from workers discoverable within + * leader. + */ +typedef struct BufFileOp +{ + /* +* leaderPid is leader process PID. +* +* tempFileIdent is an identifier for a particular temp file (or parallel +* temp file op) for the leader. Needed to distinguish multiple parallel +* temp file operations within a given leader process. +*/ + int leaderPid; + longtempFileIdent; +} BufFileOp; + > Right, that is a problem. A refcount mode could fix that; virtual > file descriptors would be closed in every backend using the current > resource owner, and the files would be deleted when the last one turns > out the lights. Yeah. That's basically what the BufFile unification process can provide you with (or will, once I get around to implementing the refcount thing, which shouldn't be too hard). As already noted, I'll also want to make it defer creation of a leader-owned segment, unless and until that proves necessary, which it never will for hash join. Perhaps I should make superficial changes to unification in my patch to suit your work, like rename the field BufFileOp.leaderPid to BufFileOp.ownerPid, without actually changing any behaviors, except as noted in the last paragraph. Since you only require that backends be able to open up some other backend's temp file themselves for a short while, that gives you everything you need. You'll be doing unification in backends, and not just within the leader as in the tuplesort patch, I believe, but that's just fine. All that matters is that you present all data at once to a consuming backend via unification (since you treat temp file contents as immutable, this will be true for hash join, just as it is for tuplesort). There is a good argument against my making such a tweak, however, which is that maybe it's clearer to DBAs what's going on if temp file names have the leader PID in them for all operations. So, maybe BufFileOp.leaderPid isn't renamed to BufFileOp.ownerPid by me; instead, you always make it the leader pid, while at the same time having the leader dole out BufFileOp.tempFileIdent identifiers to each worker as needed (see how I generate BufFileOps for an idea of what I mean if it's not immediately clear). That's also an easy change, or at least will be once the refcount thing is added. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Thu, Jan 12, 2017 at 9:07 AM, Thomas Munrowrote: > On Wed, Jan 11, 2017 at 2:56 PM, Peter Geoghegan wrote: > > On Fri, Jan 6, 2017 at 12:01 PM, Thomas Munro > > wrote: > >> Here is a new WIP patch. I have plenty of things to tidy up (see note > >> at end), but the main ideas are now pretty clear and I'd appreciate > >> some feedback. > > > > I have some review feedback for your V3. I've chosen to start with the > > buffile.c stuff, since of course it might share something with my > > parallel tuplesort patch. This isn't comprehensive, but I will have > > more comprehensive feedback soon. > > Thanks! > > > I'm not surprised that you've generally chosen to make shared BufFile > > management as simple as possible, with no special infrastructure other > > than the ability to hold open other backend temp files concurrently > > within a worker, and no writing to another worker's temp file, or > > shared read pointer. As you put it, everything is immutable. I > > couldn't see much opportunity for adding a lot of infrastructure that > > wasn't written explicitly as parallel hash join code/infrastructure. > > My sense is that that was a good decision. I doubted that you'd ever > > want some advanced, generic shared BufFile thing with multiple read > > pointers, built-in cache coherency, etc. (Robert seemed to think that > > you'd go that way, though.) > > Right, this is extremely minimalist infrastructure. fd.c is > unchanged. buffile.c only gains the power to export/import read-only > views of BufFiles. There is no 'unification' of BufFiles: each hash > join participant simply reads from the buffile it wrote, and then > imports and reads from its peers' BufFiles, until all are exhausted; > so the 'unification' is happening in caller code which knows about the > set of participants and manages shared read positions. Clearly there > are some ownership/cleanup issues to straighten out, but I think those > problems are fixable (probably involving refcounts). > > I'm entirely willing to throw that away and use the unified BufFile > concept, if it can be extended to support multiple readers of the > data, where every participant unifies the set of files. I have so far > assumed that it would be most efficient for each participant to read > from the file that it wrote before trying to read from files written > by other participants. I'm reading your patch now; more soon. > > > Anyway, some more specific observations: > > > > * ISTM that this is the wrong thing for shared BufFiles: > > > >> +BufFile * > >> +BufFileImport(BufFileDescriptor *descriptor) > >> +{ > > ... > >> + file->isInterXact = true; /* prevent cleanup by this backend */ > > > > There is only one user of isInterXact = true BufFiles at present, > > tuplestore.c. It, in turn, only does so for cases that require > > persistent tuple stores. A quick audit of these tuplestore.c callers > > show this to just be cursor support code within portalmem.c. Here is > > the relevant tuplestore_begin_heap() rule that that code adheres to, > > unlike the code I've quoted above: > > > > * interXact: if true, the files used for on-disk storage persist beyond > the > > * end of the current transaction. NOTE: It's the caller's > responsibility to > > * create such a tuplestore in a memory context and resource owner that > will > > * also survive transaction boundaries, and to ensure the tuplestore is > closed > > * when it's no longer wanted. > > Hmm. Yes, that is an entirely bogus use of isInterXact. I am > thinking about how to fix that with refcounts. > > > I don't think it's right for buffile.c to know anything about file > > paths directly -- I'd say that that's a modularity violation. > > PathNameOpenFile() is called by very few callers at the moment, all of > > them very low level (e.g. md.c), but you're using it within buffile.c > > to open a path to the file that you obtain from shared memory > > Hmm. I'm not seeing the modularity violation. buffile.c uses > interfaces already exposed by fd.c to do this: OpenTemporaryFile, > then FilePathName to find the path, then PathNameOpenFile to open from > another process. I see that your approach instead has client code > provide more meta data so that things can be discovered, which may > well be a much better idea. > > > directly. This is buggy because the following code won't be reached in > > workers that call your BufFileImport() function: > > > > /* Mark it for deletion at close */ > > VfdCache[file].fdstate |= FD_TEMPORARY; > > > > /* Register it with the current resource owner */ > > if (!interXact) > > { > > VfdCache[file].fdstate |= FD_XACT_TEMPORARY; > > > > ResourceOwnerEnlargeFiles(CurrentResourceOwner); > > ResourceOwnerRememberFile(CurrentResourceOwner, file); > > VfdCache[file].resowner = CurrentResourceOwner; > > > > /* ensure cleanup happens
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Wed, Jan 11, 2017 at 2:56 PM, Peter Geogheganwrote: > On Fri, Jan 6, 2017 at 12:01 PM, Thomas Munro > wrote: >> Here is a new WIP patch. I have plenty of things to tidy up (see note >> at end), but the main ideas are now pretty clear and I'd appreciate >> some feedback. > > I have some review feedback for your V3. I've chosen to start with the > buffile.c stuff, since of course it might share something with my > parallel tuplesort patch. This isn't comprehensive, but I will have > more comprehensive feedback soon. Thanks! > I'm not surprised that you've generally chosen to make shared BufFile > management as simple as possible, with no special infrastructure other > than the ability to hold open other backend temp files concurrently > within a worker, and no writing to another worker's temp file, or > shared read pointer. As you put it, everything is immutable. I > couldn't see much opportunity for adding a lot of infrastructure that > wasn't written explicitly as parallel hash join code/infrastructure. > My sense is that that was a good decision. I doubted that you'd ever > want some advanced, generic shared BufFile thing with multiple read > pointers, built-in cache coherency, etc. (Robert seemed to think that > you'd go that way, though.) Right, this is extremely minimalist infrastructure. fd.c is unchanged. buffile.c only gains the power to export/import read-only views of BufFiles. There is no 'unification' of BufFiles: each hash join participant simply reads from the buffile it wrote, and then imports and reads from its peers' BufFiles, until all are exhausted; so the 'unification' is happening in caller code which knows about the set of participants and manages shared read positions. Clearly there are some ownership/cleanup issues to straighten out, but I think those problems are fixable (probably involving refcounts). I'm entirely willing to throw that away and use the unified BufFile concept, if it can be extended to support multiple readers of the data, where every participant unifies the set of files. I have so far assumed that it would be most efficient for each participant to read from the file that it wrote before trying to read from files written by other participants. I'm reading your patch now; more soon. > Anyway, some more specific observations: > > * ISTM that this is the wrong thing for shared BufFiles: > >> +BufFile * >> +BufFileImport(BufFileDescriptor *descriptor) >> +{ > ... >> + file->isInterXact = true; /* prevent cleanup by this backend */ > > There is only one user of isInterXact = true BufFiles at present, > tuplestore.c. It, in turn, only does so for cases that require > persistent tuple stores. A quick audit of these tuplestore.c callers > show this to just be cursor support code within portalmem.c. Here is > the relevant tuplestore_begin_heap() rule that that code adheres to, > unlike the code I've quoted above: > > * interXact: if true, the files used for on-disk storage persist beyond the > * end of the current transaction. NOTE: It's the caller's responsibility to > * create such a tuplestore in a memory context and resource owner that will > * also survive transaction boundaries, and to ensure the tuplestore is closed > * when it's no longer wanted. Hmm. Yes, that is an entirely bogus use of isInterXact. I am thinking about how to fix that with refcounts. > I don't think it's right for buffile.c to know anything about file > paths directly -- I'd say that that's a modularity violation. > PathNameOpenFile() is called by very few callers at the moment, all of > them very low level (e.g. md.c), but you're using it within buffile.c > to open a path to the file that you obtain from shared memory Hmm. I'm not seeing the modularity violation. buffile.c uses interfaces already exposed by fd.c to do this: OpenTemporaryFile, then FilePathName to find the path, then PathNameOpenFile to open from another process. I see that your approach instead has client code provide more meta data so that things can be discovered, which may well be a much better idea. > directly. This is buggy because the following code won't be reached in > workers that call your BufFileImport() function: > > /* Mark it for deletion at close */ > VfdCache[file].fdstate |= FD_TEMPORARY; > > /* Register it with the current resource owner */ > if (!interXact) > { > VfdCache[file].fdstate |= FD_XACT_TEMPORARY; > > ResourceOwnerEnlargeFiles(CurrentResourceOwner); > ResourceOwnerRememberFile(CurrentResourceOwner, file); > VfdCache[file].resowner = CurrentResourceOwner; > > /* ensure cleanup happens at eoxact */ > have_xact_temporary_files = true; > } Right, that is a problem. A refcount mode could fix that; virtual file descriptors would be closed in every backend using the current resource owner, and the files would be deleted when the last one turns out the
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Wed, Jan 11, 2017 at 12:05 PM, Robert Haaswrote: > On Wed, Jan 11, 2017 at 2:20 PM, Peter Geoghegan wrote: >> You'd probably still want to throw an error when workers ended up not >> deleting BufFile segments they owned, though, at least for parallel >> tuplesort. > > Don't see why. Simply because that's not expected as things stand -- why should the file go away in that context? (Admittedly, that doesn't seem like an excellent reason now.) I actually like the idea of a reference count, the more I think about it, since it doesn't actually have any tension with my original idea of ownership. If something like a randomAccess parallel tuplesort leader merge needs to write new segments (which it almost certainly *won't* anyway, due to my recent V7 changes), then it can still own those new segments itself, alone, and delete them on its own in the manner of conventional temp files, because we can still restrict the shared refcount mechanism to the deletion of "initial" segments. The refcount == 0 deleter only deletes those initial segments, and not any same-BufFile segments that might have been added (added to append to our unified BufFile within leader). I think that parallel hash join won't use this at all, and, as I said, it's only a theoretical requirement for parallel tuplesort, which will generally recycle blocks from worker temp files for its own writes all the time for randomAccess cases, the only cases that ever write within logtape.c. So, the only BufFile shared state needed, that must be maintained over time, is the refcount variable itself. The size of the "initial" BufFile (from which we derive number of new segments during unification) is passed, but it doesn't get maintained in shared memory. BufFile size remains a one way, one time message needed during unification. I only really need to tweak things in fd.c temp routines to make all this work. This is something I like because it makes certain theoretically useful things easier. Say, for example, we wanted to have tuplesort workers merge worker final materialized tapes (their final output), in order to arrange for the leader to have fewer than $NWORKER runs to merge at the end -- that's made easier by the refcount stuff. (I'm still not convinced that that's actually going to make CREATE INDEX faster. Still, it should, on general principle, be easy to write a patch that makes it happen -- a good overall design should leave things so that writing that prototype patch is easy.) >> This idea is something that's much more limited than the >> SharedTemporaryFile() API that you sketched on the parallel sort >> thread, because it only concerns resource management, and not how to >> make access to the shared file concurrency safe in any special, >> standard way. > > Actually, I only intended that sketch to be about resource management. > Sounds like I didn't explain very well. I'm glad to hear that, because I was very puzzled by what you said. I guess I was thrown off by "shared read pointers". I don't want to get into the business of flushing out dirty buffers, or making sure that every backend stays in lockstep about what the total size of the BufFile needs to be. It's so much simpler to just have clear "barriers" for each parallel operation, where backends present a large amount of immutable state to one other backend at the end, and tells it how big its BufFile is only once. (It's not quite immutable, since randomAccess recycle of temp files can happen within logtape.c, but the point is that there should be very little back and forth -- that needs to be severely restricted.) -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Wed, Jan 11, 2017 at 11:20 AM, Peter Geogheganwrote: >> If multiple processes are using the same file via the BufFile >> interface, I think that it is absolutely necessary that there should >> be a provision to track the "attach count" of the BufFile. Each >> process that reaches EOXact decrements the attach count and when it >> reaches 0, the process that reduced it to 0 removes the BufFile. I >> think anything that's based on the notion that leaders will remove >> files and workers won't is going to be fragile and limiting, and I am >> going to push hard against any such proposal. > > Okay. My BufFile unification approach happens to assume that backends > clean up after themselves, but that isn't a ridged assumption (of > course, these are always temp files, so we reason about them as temp > files). Also, to be clear, and to avoid confusion: I don't think anyone wants an approach "that's based on the notion that leaders will remove files and workers won't". All that has been suggested is that the backend that creates the file should be responsible for deleting the file, by definition. And, that any other backend that may have files owned by another backend must be sure to not try to access them after the owner deletes them. (Typically, that would be ensured by some barrier condition, some dependency, inherent to how the parallel operation is implemented.) I will implement the reference count thing. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Wed, Jan 11, 2017 at 2:20 PM, Peter Geogheganwrote: > You'd probably still want to throw an error when workers ended up not > deleting BufFile segments they owned, though, at least for parallel > tuplesort. Don't see why. > This idea is something that's much more limited than the > SharedTemporaryFile() API that you sketched on the parallel sort > thread, because it only concerns resource management, and not how to > make access to the shared file concurrency safe in any special, > standard way. Actually, I only intended that sketch to be about resource management. Sounds like I didn't explain very well. > Instead, they should be passing around some kind of minimal > private-to-buffile state in shared memory that coordinates backends > participating in BufFile unification. Private state created by > buffile.c, and passed back to buffile.c. Everything should be > encapsulated within buffile.c, IMV, making parallel implementations as > close as possible to their serial implementations. That seems reasonable although I haven't studied the details carefully as yet. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Wed, Jan 11, 2017 at 10:57 AM, Robert Haaswrote: > On Tue, Jan 10, 2017 at 8:56 PM, Peter Geoghegan wrote: >> Instead of all this, I suggest copying some of my changes to fd.c, so >> that resource ownership within fd.c differentiates between a vfd that >> is owned by the backend in the conventional sense, including having a >> need to delete at eoxact, as well as a lesser form of ownership where >> deletion should not happen. > > If multiple processes are using the same file via the BufFile > interface, I think that it is absolutely necessary that there should > be a provision to track the "attach count" of the BufFile. Each > process that reaches EOXact decrements the attach count and when it > reaches 0, the process that reduced it to 0 removes the BufFile. I > think anything that's based on the notion that leaders will remove > files and workers won't is going to be fragile and limiting, and I am > going to push hard against any such proposal. Okay. My BufFile unification approach happens to assume that backends clean up after themselves, but that isn't a ridged assumption (of course, these are always temp files, so we reason about them as temp files). It could be based on a refcount fairly easily, such that, as you say here, deletion of files occurs within workers (that "own" the files) only as a consequence of their being the last backend with a reference, that must therefore "turn out the lights" (delete the file). That seems consistent with what I've done within fd.c, and what I suggested to Thomas (that he more or less follow that approach). You'd probably still want to throw an error when workers ended up not deleting BufFile segments they owned, though, at least for parallel tuplesort. This idea is something that's much more limited than the SharedTemporaryFile() API that you sketched on the parallel sort thread, because it only concerns resource management, and not how to make access to the shared file concurrency safe in any special, standard way. I think that this resource management is something that should be managed by buffile.c (and the temp file routines within fd.c that are morally owned by buffile.c, their only caller). It shouldn't be necessary for a client of this new infrastructure, such as parallel tuplesort or parallel hash join, to know anything about file paths. Instead, they should be passing around some kind of minimal private-to-buffile state in shared memory that coordinates backends participating in BufFile unification. Private state created by buffile.c, and passed back to buffile.c. Everything should be encapsulated within buffile.c, IMV, making parallel implementations as close as possible to their serial implementations. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Tue, Jan 10, 2017 at 8:56 PM, Peter Geogheganwrote: > Instead of all this, I suggest copying some of my changes to fd.c, so > that resource ownership within fd.c differentiates between a vfd that > is owned by the backend in the conventional sense, including having a > need to delete at eoxact, as well as a lesser form of ownership where > deletion should not happen. If multiple processes are using the same file via the BufFile interface, I think that it is absolutely necessary that there should be a provision to track the "attach count" of the BufFile. Each process that reaches EOXact decrements the attach count and when it reaches 0, the process that reduced it to 0 removes the BufFile. I think anything that's based on the notion that leaders will remove files and workers won't is going to be fragile and limiting, and I am going to push hard against any such proposal. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Fri, Jan 6, 2017 at 12:01 PM, Thomas Munrowrote: > Here is a new WIP patch. I have plenty of things to tidy up (see note > at end), but the main ideas are now pretty clear and I'd appreciate > some feedback. I have some review feedback for your V3. I've chosen to start with the buffile.c stuff, since of course it might share something with my parallel tuplesort patch. This isn't comprehensive, but I will have more comprehensive feedback soon. I'm not surprised that you've generally chosen to make shared BufFile management as simple as possible, with no special infrastructure other than the ability to hold open other backend temp files concurrently within a worker, and no writing to another worker's temp file, or shared read pointer. As you put it, everything is immutable. I couldn't see much opportunity for adding a lot of infrastructure that wasn't written explicitly as parallel hash join code/infrastructure. My sense is that that was a good decision. I doubted that you'd ever want some advanced, generic shared BufFile thing with multiple read pointers, built-in cache coherency, etc. (Robert seemed to think that you'd go that way, though.) Anyway, some more specific observations: * ISTM that this is the wrong thing for shared BufFiles: > +BufFile * > +BufFileImport(BufFileDescriptor *descriptor) > +{ ... > + file->isInterXact = true; /* prevent cleanup by this backend */ There is only one user of isInterXact = true BufFiles at present, tuplestore.c. It, in turn, only does so for cases that require persistent tuple stores. A quick audit of these tuplestore.c callers show this to just be cursor support code within portalmem.c. Here is the relevant tuplestore_begin_heap() rule that that code adheres to, unlike the code I've quoted above: * interXact: if true, the files used for on-disk storage persist beyond the * end of the current transaction. NOTE: It's the caller's responsibility to * create such a tuplestore in a memory context and resource owner that will * also survive transaction boundaries, and to ensure the tuplestore is closed * when it's no longer wanted. I don't think it's right for buffile.c to know anything about file paths directly -- I'd say that that's a modularity violation. PathNameOpenFile() is called by very few callers at the moment, all of them very low level (e.g. md.c), but you're using it within buffile.c to open a path to the file that you obtain from shared memory directly. This is buggy because the following code won't be reached in workers that call your BufFileImport() function: /* Mark it for deletion at close */ VfdCache[file].fdstate |= FD_TEMPORARY; /* Register it with the current resource owner */ if (!interXact) { VfdCache[file].fdstate |= FD_XACT_TEMPORARY; ResourceOwnerEnlargeFiles(CurrentResourceOwner); ResourceOwnerRememberFile(CurrentResourceOwner, file); VfdCache[file].resowner = CurrentResourceOwner; /* ensure cleanup happens at eoxact */ have_xact_temporary_files = true; } Certainly, you don't want the "Mark it for deletion at close" bit. Deletion should not happen at eoxact for non-owners-but-sharers (within FileClose()), but you *do* want CleanupTempFiles() to call FileClose() for the virtual file descriptors you've opened in the backend, to do some other cleanup. In general, you want to buy into resource ownership for workers. As things stand, I think that this will leak virtual file descriptors. That's really well hidden because there is a similar CleanupTempFiles() call at proc exit, I think. (Didn't take the time to make sure that that's what masked problems. I'm sure that you want minimal divergence with serial cases, resource-ownership-wise, in any case.) Instead of all this, I suggest copying some of my changes to fd.c, so that resource ownership within fd.c differentiates between a vfd that is owned by the backend in the conventional sense, including having a need to delete at eoxact, as well as a lesser form of ownership where deletion should not happen. Maybe you'll end up using my BufFileUnify interface [1] within workers (instead of just within the leader, as with parallel tuplesort), and have it handle all of that for you. Currently, that would mean that there'd be an unused/0 sized "local" segment for the unified BufFile, but I was thinking of making that not happen unless and until a new segment is actually needed, so even that minor wart wouldn't necessarily affect you. > Some assorted notes on the status: I need to do some thinking about > the file cleanup logic: both explicit deletes at the earliest possible > time, and failure/error paths. Currently the creator of each file is > responsible for cleaning it up, but I guess if the creator aborts > early the file disappears underneath the others' feet, and then I > guess they might raise a confusing error report that races against the > root cause error report;
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Sat, Jan 7, 2017 at 9:01 AM, Thomas Munrowrote: > On Tue, Jan 3, 2017 at 10:53 PM, Thomas Munro > wrote: >> I will post a new rebased version soon with that and >> some other nearby problems fixed. > > Here is a new WIP patch. To make this easier to understand and harmonise the logic used in a few places, I'm now planning to chop it up into a patch series, probably something like this: 1. Change existing hash join code to use chunk-based accounting 2. Change existing hash join code to use a new interface for dealing with batches 3. Add shared hash join support, single batch only 4. Add components for doing shared batch reading (unused) 5. Add multi-batch shared hash join support -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Sat, Jan 7, 2017 at 9:01 AM, Thomas Munrowrote: > On Tue, Jan 3, 2017 at 10:53 PM, Thomas Munro > wrote: >> I will post a new rebased version soon with that and >> some other nearby problems fixed. > > Here is a new WIP patch. I forgot to mention: this applies on top of barrier-v5.patch, over here: https://www.postgresql.org/message-id/CAEepm%3D3g3EC734kgriWseiJPfUQZeoMWdhAfzOc0ecewAa5uXg%40mail.gmail.com -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Mon, Jan 2, 2017 at 3:17 PM, Peter Geogheganwrote: > I noticed a bug in your latest revision: > >> + /* >> +* In HJ_NEED_NEW_OUTER, we already selected the current inner batch for >> +* reading from. If there is a shared hash table, we may have already >> +* partially loaded the hash table in ExecHashJoinPreloadNextBatch. >> +*/ >> + Assert(hashtable->batch_reader.batchno = curbatch); >> + Assert(hashtable->batch_reader.inner); > > Obviously this isn't supposed to be an assignment. Right, thanks! I will post a new rebased version soon with that and some other nearby problems fixed. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Sat, Dec 31, 2016 at 2:52 AM, Thomas Munrowrote: > Unfortunately it's been a bit trickier than I anticipated to get the > interprocess batch file sharing and hash table shrinking working > correctly and I don't yet have a new patch in good enough shape to > post in time for the January CF. More soon. I noticed a bug in your latest revision: > + /* > +* In HJ_NEED_NEW_OUTER, we already selected the current inner batch for > +* reading from. If there is a shared hash table, we may have already > +* partially loaded the hash table in ExecHashJoinPreloadNextBatch. > +*/ > + Assert(hashtable->batch_reader.batchno = curbatch); > + Assert(hashtable->batch_reader.inner); Obviously this isn't supposed to be an assignment. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Sat, Dec 3, 2016 at 1:38 AM, Haribabu Kommiwrote: > Moved to next CF with "waiting on author" status. Unfortunately it's been a bit trickier than I anticipated to get the interprocess batch file sharing and hash table shrinking working correctly and I don't yet have a new patch in good enough shape to post in time for the January CF. More soon. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
On Thu, Nov 3, 2016 at 4:19 PM, Thomas Munrowrote: > Obviously I'm actively working on developing and stabilising all this. > Some of the things I'm working on are: work_mem accounting, batch > increases, rescans and figuring out if the resource management for > those BufFiles is going to work. There are quite a lot of edge cases > some of which I'm still figuring out, but I feel like this approach is > workable. At this stage I want to share what I'm doing to see if > others have feedback, ideas, blood curdling screams of horror, etc. I > will have better patches and a set of test queries soon. Thanks for > reading. > > This patch doesn't receive any review. Patch is not applying properly to HEAD. Moved to next CF with "waiting on author" status. Regards, Hari Babu Fujitsu Australia
[HACKERS] WIP: [[Parallel] Shared] Hash
Hi hackers, In PostgreSQL 9.6, hash joins can be parallelised under certain conditions, but a copy of the hash table is built in every participating backend. That means that memory and CPU time are wasted. In many cases, that's OK: if the hash table contents are small and cheap to compute, then we don't really care, we're just happy that the probing can be done in parallel. But in cases where the hash table is large and/or expensive to build, we could do much better. I am working on that problem. To recap the situation in 9.6, a hash join can appear below a Gather node and it looks much the same as a non-parallel hash join except that it has a partial outer plan: -> Hash Join -> -> Hash -> A partial plan is one that has some kind of 'scatter' operation as its ultimate source of tuples. Currently the only kind of scatter operation is a Parallel Seq Scan (but see also the Parallel Index Scan and Parallel Bitmap Scan proposals). The scatter operation enables parallelism in all the executor nodes above it, as far as the enclosing 'gather' operation which must appear somewhere above it. Currently the only kind of gather operation is a Gather node (but see also the Gather Merge proposal which adds a new one). The inner plan is built from a non-partial parallel-safe path and will be run in every worker. Note that a Hash Join node in 9.6 isn't parallel-aware itself: it's not doing anything special at execution time to support parallelism. The planner has determined that correct partial results will be produced by this plan, but the executor nodes are blissfully unaware of parallelism. PROPOSED NEW PLAN VARIANTS Shortly I will post a patch which introduces two new hash join plan variants that are parallel-aware: 1. Parallel Hash Join with Shared Hash -> Parallel Hash Join -> -> Shared Hash -> In this case, there is only one copy of the hash table and only one participant loads it. The other participants wait patiently for one chosen backend to finish building the hash table, and then they all wake up and probe. Call the number of participants P, being the number of workers + 1 (for the leader). Compared to a non-shared hash plan, we avoid wasting CPU and IO resources running P copies of the inner plan in parallel (something that is not well captured in our costing model for parallel query today), and we can allow ourselves to use a hash table P times larger while sticking to the same overall space target of work_mem * P. 2. Parallel Hash Join with Parallel Shared Hash -> Parallel Hash Join -> -> Parallel Shared Hash -> In this case, the inner plan is run in parallel by all participants. We have the advantages of a shared hash table as described above, and now we can also divide the work of running the inner plan and hashing the resulting tuples by P participants. Note that Parallel Shared Hash is acting as a special kind of gather operation that is the counterpart to the scatter operation contained in the inner plan. PERFORMANCE So far I have been unable to measure any performance degradation compared with unpatched master for hash joins with non-shared hash. That's good because it means that I didn't slow existing plans down when I introduced a bunch of conditional branches to existing hash join code. Laptop testing shows greater than 2x speedups on several of the TPC-H queries with single batches, and no slowdowns. I will post test numbers on big rig hardware in the coming weeks when I have the batching code in more complete and stable shape. IMPLEMENTATION I have taken the approach of extending the existing hash join algorithm, rather than introducing separate hash join executor nodes or a fundamentally different algorithm. Here's a short description of what the patch does: 1. SHARED HASH TABLE To share data between participants, the patch uses two other patches I have proposed: DSA areas[1], which provide a higher level interface to DSM segments to make programming with processes a little more like programming with threads, and in particular a per-parallel-query DSA area[2] that is made available for any executor node that needs some shared work space. The patch uses atomic operations to push tuples into the hash table buckets while building, rehashing and loading, and then the hash table is immutable during probing (except for match flags used to implement outer joins). The existing memory chunk design is retained for dense allocation of tuples, which provides a convenient way to rehash the table when its size changes. 2. WORK COORDINATION To coordinate parallel work, this patch uses two other patches: barriers[3], to implement a 'barrier' or 'phaser' synchronisation primitive, and those in turn use the condition variables proposed by Robert Haas. Barriers provide a way for participants to break work up into