Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Fri, Nov 3, 2017 at 2:24 PM, Peter Geoghegan wrote: > Thomas Munro wrote: >> That way you don't have to opt in to BufFile's >> double buffering and segmentation schemes just to get shared file >> clean-up, if for some reason you want direct file handles. > > Is that something that you really think is possible? It's pretty far fetched, but maybe shared temporary relation files accessed via smgr.c/md.c? Or maybe future things that don't want to read/write through a buffer but instead want to mmap it. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Thomas Munro wrote: I'm going to make an item on my personal TODO list for that. No useful insights on that right now, though. I decided to try that, but it didn't really work: fd.h gets included by front-end code, so I can't very well define a struct and declare functions that deal in dsm_segment and slock_t. On the other hand it does seem a bit better to for these shared file sets to work in terms of File, not BufFile. Realistically, fd.h has a number of functions that are really owned by buffile.c already. This sounds fine. That way you don't have to opt in to BufFile's double buffering and segmentation schemes just to get shared file clean-up, if for some reason you want direct file handles. Is that something that you really think is possible? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Wed, Nov 1, 2017 at 2:11 PM, Peter Geoghegan wrote: > On Tue, Oct 31, 2017 at 5:07 PM, Thomas Munro > wrote: >> Another complaint is that perhaps fd.c >> knows too much about buffile.c's business. For example, >> RemovePgTempFilesInDir() knows about the ".set" directories created by >> buffile.c, which might be called a layering violation. Perhaps the >> set/directory logic should move entirely into fd.c, so you'd call >> FileSetInit(FileSet *), not BufFileSetInit(BufFileSet *), and then >> BufFileOpenShared() would take a FileSet *, not a BufFileSet *. >> Thoughts? > > I'm going to make an item on my personal TODO list for that. No useful > insights on that right now, though. I decided to try that, but it didn't really work: fd.h gets included by front-end code, so I can't very well define a struct and declare functions that deal in dsm_segment and slock_t. On the other hand it does seem a bit better to for these shared file sets to work in terms of File, not BufFile. That way you don't have to opt in to BufFile's double buffering and segmentation schemes just to get shared file clean-up, if for some reason you want direct file handles. So I in the v24 parallel hash patch set I just posted over in the other thread I have moved it into its own translation unit sharedfileset.c and made it work with File objects. buffile.c knows how to use it as a source of segment files. I think that's better. > If the new standard is that you have temp file names that suggest the > purpose of each temp file, then that may be something that parallel > CREATE INDEX should buy into. Yeah, I guess that could be useful. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Tue, Oct 31, 2017 at 5:07 PM, Thomas Munro wrote: > So that's this bit: > > + pg_itoa(worker, filename); > + lts->pfile = BufFileCreateShared(fileset, filename); > > ... and: > > + pg_itoa(i, filename); > + file = BufFileOpenShared(fileset, filename); Right. > What's wrong with using a worker number like this? I guess nothing, though there is the question of discoverability for DBAs, etc. You do address this separately, by having (potentially) descriptive filenames, as you go into. > It's not random choice: buffile.c creates a uniquely named directory > (or directories, if you have more than one location configured in the > temp_tablespaces GUC) to hold all the backing files involved in each > BufFileSet. Naming of BufFiles within the BufFileSet is the caller's > problem, and a worker number seems like a reasonable choice to me. It > won't collide with a concurrent parallel CREATE INDEX because that'll > be using its own BufFileSet. Oh, I see. I may have jumped the gun on that one. > One complaint about the current coding that someone might object to: > MakeSharedSegmentPath() just dumps the caller's BufFile name into a > path without sanitisation: I should fix that so that we only accept > fairly limited strings here. Another complaint is that perhaps fd.c > knows too much about buffile.c's business. For example, > RemovePgTempFilesInDir() knows about the ".set" directories created by > buffile.c, which might be called a layering violation. Perhaps the > set/directory logic should move entirely into fd.c, so you'd call > FileSetInit(FileSet *), not BufFileSetInit(BufFileSet *), and then > BufFileOpenShared() would take a FileSet *, not a BufFileSet *. > Thoughts? I'm going to make an item on my personal TODO list for that. No useful insights on that right now, though. > 3. sharedtuplestore.c takes a caller-supplied BufFileSet and creates > its shared BufFiles in there. Earlier versions created and owned a > BufFileSet, but in the current Parallel Hash patch I create loads of > separate SharedTuplestore objects but I didn't want to create load of > directories to back them, so you can give them all the same > BufFileSet. That works because SharedTuplestores are also given a > name, and it's the caller's job (in my case nodeHash.c) to make sure > the SharedTuplestores are given unique names within the same > BufFileSet. For Parallel Hash you'll see names like 'i3of8' (inner > batch 3 of 8). There is no need for there to be in any sort of > central registry for that though, because it rides on top of the > guarantees from 2 above: buffile.c will put those files into a > uniquely named directory, and that works as long as no one else is > allowed to create files or directories in the temp directory that > collide with its reserved pattern /^pgsql_tmp.+\.set$/. For the same > reason, parallel CREATE INDEX is free to use worker numbers as BufFile > names, since it has its own BufFileSet to work within. If the new standard is that you have temp file names that suggest the purpose of each temp file, then that may be something that parallel CREATE INDEX should buy into. > In an earlier version, BufFileSet was one of those annoying data > structures with a FLEXIBLE_ARRAY_MEMBER that you'd use as an > incomplete type (declared but not defined in the includable header), > and here it was being used "inside" (or rather after) SharedSort, > which *itself* had a FLEXIBLE_ARRAY_MEMBER. The reason for the > variable sized object was that I needed all backends to agree on the > set of temporary tablespace OIDs, of which there could be any number, > but I also needed a 'flat' (pointer-free) object I could stick in > relocatable shared memory. In the newest version I changed that > flexible array to tablespaces[8], because 8 should be enough > tablespaces for anyone (TM). I guess that that's something that you'll need to take up with Andres, if you haven't already. I have a hard time imagining a single query needed to use more than that many tablespaces at once, so maybe this is fine. > I don't really believe anyone uses > temp_tablespaces for IO load balancing anymore and I hate code like > the above. So I think Rushabh should now remove the above-quoted code > and just use a BufFileSet directly as a member of SharedSort. FWIW, I agree with you that nobody uses temp_tablespaces this way these days. This seems like a discussion for your hash join patch, though. I'm happy to buy into that. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Wed, Nov 1, 2017 at 11:29 AM, Peter Geoghegan wrote: > On Thu, Oct 26, 2017 at 4:22 AM, Rushabh Lathia > wrote: >> Attaching the re based patch according to the v22 parallel-hash patch sets > > I took a quick look at this today, and noticed a few issues: > > * make_name() is used to name files in sharedtuplestore.c, which is > what is passed to BufFileOpenShared() for parallel hash join. Your > using your own logic for that within the equivalent logtape.c call to > BufFileOpenShared(), presumably because make_name() wants to identify > participants by PID rather than by an ordinal identifier number. So that's this bit: + pg_itoa(worker, filename); + lts->pfile = BufFileCreateShared(fileset, filename); ... and: + pg_itoa(i, filename); + file = BufFileOpenShared(fileset, filename); What's wrong with using a worker number like this? > I think that we need some kind of central registry for things that use > shared buffiles. It could be that sharedtuplestore.c is further > generalized to support this, or it could be that they both call > something else that takes care of naming. It's not okay to have this > left to random chance. It's not random choice: buffile.c creates a uniquely named directory (or directories, if you have more than one location configured in the temp_tablespaces GUC) to hold all the backing files involved in each BufFileSet. Naming of BufFiles within the BufFileSet is the caller's problem, and a worker number seems like a reasonable choice to me. It won't collide with a concurrent parallel CREATE INDEX because that'll be using its own BufFileSet. > You're going to have to ask Thomas about this. You should also use > MAXPGPATH for the char buffer on the stack. Here's a summary of namespace management scheme I currently have at the three layers fd.c, buffile.c, sharedtuplestore.c: 1. fd.c has new lower-level functions provides PathNameCreateTemporaryFile(const char *path) and PathNameOpenTemporaryFile(const char *path). It also provides PathNameCreateTemporaryDir(). Clearly callers of these interfaces will need to be very careful about managing the names they use. Callers also own the problem of cleaning up files, since there is no automatic cleanup of files created this way. My intention was that these facilities would *only* be used by BufFileSet, since it has machinery to manage those things. 2. buffile.c introduces BufFileSet, which is conceptually a set of BufFiles that can be shared by multiple backends with DSM segment-scoped cleanup. It is implemented as a set of directories: one for each tablespace in temp_tablespaces. It controls the naming of those directories. The BufFileSet directories are named similarly to fd.c's traditional temporary file names using the usual recipe "pgsql_tmp" + PID + per-process counter but have an additional ".set" suffix. RemovePgTempFilesInDir() recognises directories with that prefix and suffix as junk left over from a crash when cleaning up. I suppose it's that knowledge about reserved name patterns and cleanup that you are thinking of as a central registry? As for the BufFiles that are in a BufFileSet, buffile.c has no opinion on that: the calling code (parallel CREATE INDEX, sharedtuplestore.c, ...) is responsible for coming up with its own scheme. If parallel CREATE INDEX wants to name shared BufFiles "walrus" and "banana", that's OK by me, and those files won't collide with anything in another BufFileSet because each BufFileSet has its own directory (-ies). One complaint about the current coding that someone might object to: MakeSharedSegmentPath() just dumps the caller's BufFile name into a path without sanitisation: I should fix that so that we only accept fairly limited strings here. Another complaint is that perhaps fd.c knows too much about buffile.c's business. For example, RemovePgTempFilesInDir() knows about the ".set" directories created by buffile.c, which might be called a layering violation. Perhaps the set/directory logic should move entirely into fd.c, so you'd call FileSetInit(FileSet *), not BufFileSetInit(BufFileSet *), and then BufFileOpenShared() would take a FileSet *, not a BufFileSet *. Thoughts? 3. sharedtuplestore.c takes a caller-supplied BufFileSet and creates its shared BufFiles in there. Earlier versions created and owned a BufFileSet, but in the current Parallel Hash patch I create loads of separate SharedTuplestore objects but I didn't want to create load of directories to back them, so you can give them all the same BufFileSet. That works because SharedTuplestores are also given a name, and it's the caller's job (in my case nodeHash.c) to make sure the SharedTuplestores are given unique names within the same BufFileSet. For Parallel Hash you'll see names like 'i3of8' (inner batch 3 of 8). There is no need for there to be in any sort of central registry for that though, because it rides on top of the guarantees from 2 above: buffile.c will put those files into a uniquely named d
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Thu, Oct 26, 2017 at 4:22 AM, Rushabh Lathia wrote: > Attaching the re based patch according to the v22 parallel-hash patch sets I took a quick look at this today, and noticed a few issues: * make_name() is used to name files in sharedtuplestore.c, which is what is passed to BufFileOpenShared() for parallel hash join. Your using your own logic for that within the equivalent logtape.c call to BufFileOpenShared(), presumably because make_name() wants to identify participants by PID rather than by an ordinal identifier number. I think that we need some kind of central registry for things that use shared buffiles. It could be that sharedtuplestore.c is further generalized to support this, or it could be that they both call something else that takes care of naming. It's not okay to have this left to random chance. You're going to have to ask Thomas about this. You should also use MAXPGPATH for the char buffer on the stack. * This logtape.c comment needs to be updated, as it's no longer true: * successfully. In general, workers can take it that the leader will * reclaim space in files under their ownership, and so should not * reread from tape. * Robert hated the comment changes in the header of nbtsort.c. You might want to change it back, because he is likely to be the one that commits this. * You should look for similar comments in tuplesort.c (IIRC a couple of places will need to be revised). * tuplesort_begin_common() should actively reject a randomAccess parallel case using elog(ERROR). * tuplesort.h should note that randomAccess isn't supported, too. * What's this all about?: + /* Accessor for the SharedBufFileSet that is at the end of Sharedsort. */ + #define GetSharedBufFileSet(shared)\ + ((BufFileSet *) (&(shared)->tapes[(shared)->nTapes])) You can't just cast from one type to the other without regard for the underling size of the shared memory buffer, which is what this looks like to me. This only fails to crash because you're only abusing the last member in the tapes array for this purpose, and there happens to be enough shared memory slop that you get away with it. I'm pretty sure that ltsUnify() ends up clobbering the last/leader tape, which is a place where BufFileSet is also used, so this is just wrong. You should rethink the shmem structure a little bit. * There is still that FIXME comment within leader_takeover_tapes(). I believe that you should still have a leader tape (at least in local memory in the leader), even though you'll never be able to do anything with it, since randomAccess is no longer supported. You can remove the FIXME, and just note that you have a leader tape to be consistent with the serial case, though recognize that it's not useful. Note that even with randomAccess, we always had the leader tape, so it's not that different, really. I suppose it might make sense to make shared->tapes not have a leader tape. It hardly matters -- perhaps you should leave it there in order to keep the code simple, as you'll be keeping the leader tape in local memory, too. (But it still won't fly to continue to clobber it, of course -- you still need to find a dedicated place for BufFileSet in shared memory.) That's all I have right now. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Wed, Sep 20, 2017 at 2:32 AM, Rushabh Lathia wrote: > Yes, I haven't touched the randomAccess part yet. My initial goal was > to incorporate the BufFileSet api's here. This is going to need a rebase, due to the commit today to remove replacement selection sort. That much should be easy. > Sorry, I didn't get this part. Are you talking about the your patch changes > into OpenTemporaryFileInTablespace(), BufFileUnify() and other changes > related to ltsUnify() ? If that's the case, I don't think it require with > the > BufFileSet. Correct me if I am wrong here. I thought that you'd have multiple BufFiles, which would be multiplexed (much like a single BufFile itself mutiplexes 1GB segments), so that logtape.c could still recycle space in the randomAccess case. I guess that that's not a goal now. > To be frank its too early for me to comment anything in this area. I need > to study this more closely. As an initial goal I was just focused on > understanding the current implementation of the patch and incorporate > the BufFileSet APIs. Fair enough. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Wed, Sep 20, 2017 at 5:32 AM, Rushabh Lathia wrote: > First application for the tuplesort here is CREATE INDEX and that doesn't > need randomAccess. But as you said and in the thread its been discussed, > randomAccess is an important and we should sure put an efforts to support > the same. There's no direct benefit of working on randomAccess support unless we have some code that wants to use that support for something. Indeed, it would just leave us with code we couldn't test. While I do agree that there are probably use cases for randomAccess, I think what we should do right now is try to get this patch reviewed and committed so that we have parallel CREATE INDEX for btree indexes. And in so doing, let's keep it as simple as possible. Parallel CREATE INDEX for btree indexes is a great feature without adding any more complexity. Later, anybody who wants to work on randomAccess support -- and whatever planner and executor changes are needed to make effective use of it -- can do so. For example, one can imagine a plan like this: Gather -> Merge Join -> Parallel Index Scan -> Parallel Sort -> Parallel Seq Scan If the parallel sort reads out all of the output in every worker, then it becomes legal to do this kind of thing -- it would end up, I think, being quite similar to Parallel Hash. However, there's some question in my mind as to whether want to do this or, say, hash-partition both relations and then perform separate joins on each partition. The above plan is clearly better than what we can do today, where every worker would have to repeat the sort, ugh, but I don't know if it's the best plan. Fortunately, to get this patch committed, we don't have to figure that out. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Wed, Sep 20, 2017 at 5:17 AM, Peter Geoghegan wrote: > On Tue, Sep 19, 2017 at 3:21 AM, Rushabh Lathia > wrote: > > As per the earlier discussion in the thread, I did experiment using > > BufFileSet interface from parallel-hash-v18.patchset. I took the > reference > > of parallel-hash other patches to understand the BufFileSet APIs, and > > incorporate the changes to parallel create index. > > > > In order to achieve the same: > > > > - Applied 0007-Remove-BufFile-s-isTemp-flag.patch and > > 0008-Add-BufFileSet-for-sharing-temporary-files-between-b.patch from the > > parallel-hash-v18.patchset. > > - Removed the buffile.c/logtap.c/fd.c changes from the parallel CREATE > > INDEX v10 patch. > > - incorporate the BufFileSet API to the parallel tuple sort for CREATE > > INDEX. > > - Changes into few existing functions as well as added few to support the > > BufFileSet changes. > > I'm glad that somebody is working on this. (Someone closer to the more > general work on shared/parallel BufFile infrastructure than I am.) > > I do have some quick feedback, and I hope to be able to provide that > to both you and Thomas, as needed to see this one through. I'm not > going to get into the tricky details around resource management just > yet. I'll start with some simpler questions, to get a general sense of > the plan here. > > Thanks Peter. > I gather that you're at least aware that your v11 of the patch doesn't > preserve randomAccess support for parallel sorts, because you didn't > include my 0002-* testing GUCs patch, which was specifically designed > to make various randomAccess stuff testable. I also figured this to be > true because I noticed this FIXME among (otherwise unchanged) > tuplesort code: > > Yes, I haven't touched the randomAccess part yet. My initial goal was to incorporate the BufFileSet api's here. > > +static void > > +leader_takeover_tapes(Tuplesortstate *state) > > +{ > > + Sharedsort *shared = state->shared; > > + int nLaunched = state->nLaunched; > > + int j; > > + > > + Assert(LEADER(state)); > > + Assert(nLaunched >= 1); > > + Assert(nLaunched == shared->workersFinished); > > + > > + /* > > +* Create the tapeset from worker tapes, including a leader-owned > tape at > > +* the end. Parallel workers are far more expensive than logical > tapes, > > +* so the number of tapes allocated here should never be excessive. > FIXME > > +*/ > > + inittapestate(state, nLaunched + 1); > > + state->tapeset = LogicalTapeSetCreate(nLaunched + 1, shared->tapes, > > + state->fileset, state->worker); > > It's not surprising to me that you do not yet have this part working, > because much of my design was about changing as little as possible > above the BufFile interface, in order for tuplesort.c (and logtape.c) > code like this to "just work" as if it was the serial case. Right. I just followed your design in the your earlier patches. > It doesn't > look like you've added the kind of BufFile multiplexing code that I > expected to see in logtape.c. This is needed to compensate for the > code removed from fd.c and buffile.c. Perhaps it would help me to go > look at Thomas' latest parallel hash join patch -- did it gain some > kind of transparent multiplexing ability that you actually (want to) > use here? > > Sorry, I didn't get this part. Are you talking about the your patch changes into OpenTemporaryFileInTablespace(), BufFileUnify() and other changes related to ltsUnify() ? If that's the case, I don't think it require with the BufFileSet. Correct me if I am wrong here. Though randomAccess isn't used by CREATE INDEX in general, and so not > supporting randomAccess within tuplesort.c for parallel callers > doesn't matter as far as this CREATE INDEX user-visible feature is > concerned, I still believe that randomAccess is important (IIRC, > Robert thought so too). Specifically, it seems like a good idea to > have randomAccess support, both on general principle (why should the > parallel case be different?), and because having it now will probably > enable future enhancements to logtape.c. Enhancements that have it > manage parallel sorts based on partitioning/distribution/bucketing > [1]. I'm pretty sure that partitioning-based parallel sort is going to > become very important in the future, especially for parallel > GroupAggregate. The leader needs to truly own the tapes it reclaims > from workers in order for all of this to work. > > First application for the tuplesort here is CREATE INDEX and that doesn't need randomAccess. But as you said and in the thread its been discussed, randomAccess is an important and we should sure put an efforts to support the same. Questions on where you're going with randomAccess support: > > 1. Is randomAccess support a goal for you here at all? > > 2. If so, is preserving eager recycling of temp file space during > randomAccess (materializing a final output tape within the
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Tue, Sep 19, 2017 at 3:21 AM, Rushabh Lathia wrote: > As per the earlier discussion in the thread, I did experiment using > BufFileSet interface from parallel-hash-v18.patchset. I took the reference > of parallel-hash other patches to understand the BufFileSet APIs, and > incorporate the changes to parallel create index. > > In order to achieve the same: > > - Applied 0007-Remove-BufFile-s-isTemp-flag.patch and > 0008-Add-BufFileSet-for-sharing-temporary-files-between-b.patch from the > parallel-hash-v18.patchset. > - Removed the buffile.c/logtap.c/fd.c changes from the parallel CREATE > INDEX v10 patch. > - incorporate the BufFileSet API to the parallel tuple sort for CREATE > INDEX. > - Changes into few existing functions as well as added few to support the > BufFileSet changes. I'm glad that somebody is working on this. (Someone closer to the more general work on shared/parallel BufFile infrastructure than I am.) I do have some quick feedback, and I hope to be able to provide that to both you and Thomas, as needed to see this one through. I'm not going to get into the tricky details around resource management just yet. I'll start with some simpler questions, to get a general sense of the plan here. I gather that you're at least aware that your v11 of the patch doesn't preserve randomAccess support for parallel sorts, because you didn't include my 0002-* testing GUCs patch, which was specifically designed to make various randomAccess stuff testable. I also figured this to be true because I noticed this FIXME among (otherwise unchanged) tuplesort code: > +static void > +leader_takeover_tapes(Tuplesortstate *state) > +{ > + Sharedsort *shared = state->shared; > + int nLaunched = state->nLaunched; > + int j; > + > + Assert(LEADER(state)); > + Assert(nLaunched >= 1); > + Assert(nLaunched == shared->workersFinished); > + > + /* > +* Create the tapeset from worker tapes, including a leader-owned tape at > +* the end. Parallel workers are far more expensive than logical tapes, > +* so the number of tapes allocated here should never be excessive. FIXME > +*/ > + inittapestate(state, nLaunched + 1); > + state->tapeset = LogicalTapeSetCreate(nLaunched + 1, shared->tapes, > + state->fileset, state->worker); It's not surprising to me that you do not yet have this part working, because much of my design was about changing as little as possible above the BufFile interface, in order for tuplesort.c (and logtape.c) code like this to "just work" as if it was the serial case. It doesn't look like you've added the kind of BufFile multiplexing code that I expected to see in logtape.c. This is needed to compensate for the code removed from fd.c and buffile.c. Perhaps it would help me to go look at Thomas' latest parallel hash join patch -- did it gain some kind of transparent multiplexing ability that you actually (want to) use here? Though randomAccess isn't used by CREATE INDEX in general, and so not supporting randomAccess within tuplesort.c for parallel callers doesn't matter as far as this CREATE INDEX user-visible feature is concerned, I still believe that randomAccess is important (IIRC, Robert thought so too). Specifically, it seems like a good idea to have randomAccess support, both on general principle (why should the parallel case be different?), and because having it now will probably enable future enhancements to logtape.c. Enhancements that have it manage parallel sorts based on partitioning/distribution/bucketing [1]. I'm pretty sure that partitioning-based parallel sort is going to become very important in the future, especially for parallel GroupAggregate. The leader needs to truly own the tapes it reclaims from workers in order for all of this to work. Questions on where you're going with randomAccess support: 1. Is randomAccess support a goal for you here at all? 2. If so, is preserving eager recycling of temp file space during randomAccess (materializing a final output tape within the leader) another goal for you here? Do we need to preserve that property of serial external sorts, too, so that it remains true that logtape.c ensures that "the total space usage is essentially just the actual data volume, plus insignificant bookkeeping and start/stop overhead"? (I'm quoting from master's logtape.c header comments.) 3. Any ideas on next steps in support of those 2 goals? What problems do you foresee, if any? > CREATE INDEX serial_idx ON parallel_sort_test (randint); > > Without patch: > > Time: 3430054.220 ms (57:10.054) > > With patch (max_parallel_workers_maintenance = 8): > > Time: 1163445.271 ms (19:23.445) This looks very similar to my v10. While I will need to follow up on this, to make sure, it seems likely that this patch has exactly the same performance characteristics as v10. Thanks [1] https://wiki.postgresql.org/wiki/Parallel_External_Sort#Partitioning_for_parallelism_.28parallel_externa
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On 2017-02-10 07:52:57 -0500, Robert Haas wrote: > On Thu, Feb 9, 2017 at 6:38 PM, Thomas Munro > > Up until two minutes ago I assumed that policy would leave only two > > possibilities: you attach to the DSM segment and attach to the > > SharedBufFileManager successfully or you attach to the DSM segment and > > then die horribly (but not throw) and the postmaster restarts the > > whole cluster and blows all temp files away with RemovePgTempFiles(). > > But I see now in the comment of that function that crash-induced > > restarts don't call that because "someone might want to examine the > > temp files for debugging purposes". Given that policy for regular > > private BufFiles, I don't see why that shouldn't apply equally to > > shared files: after a crash restart, you may have some junk files that > > won't be cleaned up until your next clean restart, whether they were > > private or shared BufFiles. > > I think most people (other than Tom) would agree that that policy > isn't really sensible any more; it probably made sense when the > PostgreSQL user community was much smaller and consisted mostly of the > people developing PostgreSQL, but these days it's much more likely to > cause operational headaches than to help a developer debug. FWIW, we have restart_after_crash = false. If you need to debug things, you can enable that. Hence the whole RemovePgTempFiles() crash-restart exemption isn't required anymore, we have a much more targeted solution. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Wed, Mar 22, 2017 at 5:44 AM, Robert Haas wrote: >> Actually, that's quite possible with the design I came up with. > > I don't think it is. What sequence of calls do the APIs you've > proposed would accomplish that goal? I don't see anything in this > patch set that would permit anything other than a handoff from the > worker to the leader. There seems to be no way for the ref count to > be more than 1 (or 2?). See my remarks on this below. >> The >> restriction that Thomas can't live with as I've left things is that >> you have to know the number of BufFiles ahead of time. I'm pretty sure >> that that's all it is. (I do sympathize with the fact that that isn't >> very helpful to him, though.) > > I feel like there's some cognitive dissonance here. On the one hand, > you're saying we should use your design. No, I'm not. I'm saying that my design is complete on its own terms, and has some important properties that a mechanism like this ought to have. I think I've been pretty clear on my general uncertainty about the broader question. > On the other hand, you are > admitting that in at least one key respect, it won't meet Thomas's > requirements. On the third hand, you just said that you weren't > arguing for two mechanisms for sharing a BufFile across cooperating > parallel processes. I don't see how you can hold all three of those > positions simultaneously. I respect your position as the person that completely owns parallelism here. You are correct when you say that there has to be some overlap between the requirements for the mechanisms used by each patch -- there just *has* to be. As I said, I only know very approximately how much overlap that is or should be, even at this late date, and I am unfortunately not in a position to spend more time on it to find out. C'est la vie. I know that I have no chance of convincing you to adopt my design here, and you are right not to accept the design, because there is a bigger picture. And, because it's just too late now. My efforts to get ahead of that, and anticipate and provide for Thomas' requirements have failed. I admit that. But, you are asserting that my patch has specific technical defects that it does not have. I structured things this way for a reason. You are not required to agree with me in full to see that I might have had a point. I've described it as a trade-off already. I think that it will be of practical value to you to see that trade-off. This insight is what allowed me to immediately zero in on resource leak bugs in Thomas' revision of the patch from yesterday. >> It is true that a worker resowner can unlink() the files >> mid-unification, in the same manner as with conventional temp files, >> and not decrement its refcount in shared memory, or care at all in any >> special way. This is okay because the leader (in the case of parallel >> tuplesort) will realize that it should not "turn out the lights", >> finding that remaining reference when it calls BufFileClose() in >> registered callback, as it alone must. It doesn't matter that the >> unlink() may have already occurred, or may be just about to occur, >> because we are only operating on already-opened files, and never on >> the link itself (we don't have to stat() the file link for example, >> which is naturally only a task for the unlink()'ing backend anyway). >> You might say that the worker only blows away the link itself, not the >> file proper, since it may still be open in leader (say). > > Well, that sounds like it's counting on fd.c not to close the file > descriptor at an inconvenient point in time and reopen it later, which > is not guaranteed. It's true that in an error path, if the FD of the file we just opened gets swapped out, that could happen. That seems virtually impossible, and in any case the consequence is no worse than a confusing LOG message. But, yes, that's a weakness. >> This is the only way to avoid the unlink()-ENOENT-ignore hack, AFAICT, >> since only the worker itself can reliably know how many segments it >> has opened at every single instant in time. Because it's the owner! > > Above, you said that your design would allow for a group of processes > to share access to a file, with the last one that abandons it "turning > out the lights". But here, you are referring to it as having one > owner - the "only the worker itself" can know the number of segments. > Those things are exact opposites of each other. You misunderstood. Under your analogy, the worker needs to wait for someone else to enter the room before leaving, because otherwise, as an "environmentally conscious" worker, it would be compelled to turn the lights out before anyone else ever got to do anything with its files. But once someone else is in the room, the worker is free to leave without turning out the lights. I could provide a mechanism for the leader, or whatever the other backend is, to do another hand off. You're right that that is left unimplemented, but it would be a
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Tue, Mar 21, 2017 at 7:37 PM, Peter Geoghegan wrote: >>> Isn't that an essential part of having a refcount, in general? You >>> were the one that suggested refcounting. >> >> No, quite the opposite. My point in suggesting adding a refcount was >> to avoid needing to have a single owner. Instead, the process that >> decrements the reference count to zero becomes responsible for doing >> the cleanup. What you've done with the ref count is use it as some >> kind of medium for transferring responsibility from backend A to >> backend B; what I want is to allow backends A, B, C, D, E, and F to >> attach to the same shared resource, and whichever one of them happens >> to be the last one out of the room shuts off the lights. > > Actually, that's quite possible with the design I came up with. I don't think it is. What sequence of calls do the APIs you've proposed would accomplish that goal? I don't see anything in this patch set that would permit anything other than a handoff from the worker to the leader. There seems to be no way for the ref count to be more than 1 (or 2?). > The > restriction that Thomas can't live with as I've left things is that > you have to know the number of BufFiles ahead of time. I'm pretty sure > that that's all it is. (I do sympathize with the fact that that isn't > very helpful to him, though.) I feel like there's some cognitive dissonance here. On the one hand, you're saying we should use your design. On the other hand, you are admitting that in at least one key respect, it won't meet Thomas's requirements. On the third hand, you just said that you weren't arguing for two mechanisms for sharing a BufFile across cooperating parallel processes. I don't see how you can hold all three of those positions simultaneously. >> As I've said before, I think that's an anti-goal. This is a different >> problem, and trying to reuse the solution we chose for the >> non-parallel case doesn't really work. resowner.c could end up owning >> a shared reference count which it's responsible for decrementing -- >> and then decrementing it removes the file if the result is zero. But >> it can't own performing the actual unlink(), because then we can't >> support cases where the file may have multiple readers, since whoever >> owns the unlink() might try to zap the file out from under one of the >> others. > > Define "zap the file". I think, based on your remarks here, that > you've misunderstood my design. I think you should at least understand > it fully if you're going to dismiss it. zap was a colloquialism for unlink(). I concede that I don't fully understand your design, and am trying to understand those things I do not yet understand. > It is true that a worker resowner can unlink() the files > mid-unification, in the same manner as with conventional temp files, > and not decrement its refcount in shared memory, or care at all in any > special way. This is okay because the leader (in the case of parallel > tuplesort) will realize that it should not "turn out the lights", > finding that remaining reference when it calls BufFileClose() in > registered callback, as it alone must. It doesn't matter that the > unlink() may have already occurred, or may be just about to occur, > because we are only operating on already-opened files, and never on > the link itself (we don't have to stat() the file link for example, > which is naturally only a task for the unlink()'ing backend anyway). > You might say that the worker only blows away the link itself, not the > file proper, since it may still be open in leader (say). Well, that sounds like it's counting on fd.c not to close the file descriptor at an inconvenient point in time and reopen it later, which is not guaranteed. > Thomas' design cannot reliably know how many segments there are in > workers in error paths, which necessitates his unlink()-ENOENT-ignore > hack. My solution is that workers/owners look after their own temp > segments in the conventional way, until they reach BufFileClose(), > which may never come if there is an error. The only way that clean-up > won't happen in conventional resowner.c-in-worker fashion is if > BufFileClose() is reached in owner/worker. BufFileClose() must be > reached when there is no error, which has to happen anyway when using > temp files. (Else there is a temp file leak warning from resowner.c.) > > This is the only way to avoid the unlink()-ENOENT-ignore hack, AFAICT, > since only the worker itself can reliably know how many segments it > has opened at every single instant in time. Because it's the owner! Above, you said that your design would allow for a group of processes to share access to a file, with the last one that abandons it "turning out the lights". But here, you are referring to it as having one owner - the "only the worker itself" can know the number of segments. Those things are exact opposites of each other. I don't think there's any problem with ignoring ENOENT, and I don't t
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Tue, Mar 21, 2017 at 2:49 PM, Thomas Munro wrote: > I'm going to experiment with refactoring the v10 parallel CREATE INDEX > patch to use the SharedBufFileSet interface from > hj-shared-buf-file-v8.patch today and see what problems I run into. I would be happy if you took over parallel CREATE INDEX completely. It makes a certain amount of sense, and not just because I am no longer able to work on it. You're the one doing things with shared BufFiles that are of significant complexity. Certainly more complicated than what parallel CREATE INDEX needs in every way, and necessarily so. I will still have some more feedback on your shared BufFile design, though, while it's fresh in my mind. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Tue, Mar 21, 2017 at 2:03 PM, Robert Haas wrote: > I agree that the extent to which code reuse is possible here is > somewhat unclear, but I am 100% confident that the answer is non-zero. > You and Thomas both need BufFiles that can be shared across multiple > backends associated with the same ParallelContext. I don't understand > how you can argue that it's reasonable to have two different ways of > sharing the same kind of object across the same set of processes. I didn't argue that. Rather, I argued that there is going to be significant additional requirements for PHJ, because it has to support arbitrary many BufFiles, rather than either 1 or 2 (one per tuplesort/logtapeset). Just how "signficant" that would be I cannot say, regrettably. (Or, we're going to have to make logtape.c multiplex BufFiles, which risks breaking other logtape.c routines that aren't even used just yet.) >> Isn't that an essential part of having a refcount, in general? You >> were the one that suggested refcounting. > > No, quite the opposite. My point in suggesting adding a refcount was > to avoid needing to have a single owner. Instead, the process that > decrements the reference count to zero becomes responsible for doing > the cleanup. What you've done with the ref count is use it as some > kind of medium for transferring responsibility from backend A to > backend B; what I want is to allow backends A, B, C, D, E, and F to > attach to the same shared resource, and whichever one of them happens > to be the last one out of the room shuts off the lights. Actually, that's quite possible with the design I came up with. The restriction that Thomas can't live with as I've left things is that you have to know the number of BufFiles ahead of time. I'm pretty sure that that's all it is. (I do sympathize with the fact that that isn't very helpful to him, though.) > As I've said before, I think that's an anti-goal. This is a different > problem, and trying to reuse the solution we chose for the > non-parallel case doesn't really work. resowner.c could end up owning > a shared reference count which it's responsible for decrementing -- > and then decrementing it removes the file if the result is zero. But > it can't own performing the actual unlink(), because then we can't > support cases where the file may have multiple readers, since whoever > owns the unlink() might try to zap the file out from under one of the > others. Define "zap the file". I think, based on your remarks here, that you've misunderstood my design. I think you should at least understand it fully if you're going to dismiss it. It is true that a worker resowner can unlink() the files mid-unification, in the same manner as with conventional temp files, and not decrement its refcount in shared memory, or care at all in any special way. This is okay because the leader (in the case of parallel tuplesort) will realize that it should not "turn out the lights", finding that remaining reference when it calls BufFileClose() in registered callback, as it alone must. It doesn't matter that the unlink() may have already occurred, or may be just about to occur, because we are only operating on already-opened files, and never on the link itself (we don't have to stat() the file link for example, which is naturally only a task for the unlink()'ing backend anyway). You might say that the worker only blows away the link itself, not the file proper, since it may still be open in leader (say). ** We rely on the fact that files are themselves a kind of reference counted thing, in general; they have an independent existence from the link originally used to open() them. ** The reason that there is a brief wait in workers for parallel tuplesort is because it gives us the opportunity to have the immediately subsequent worker BufFileClose() not turn out the lights in worker, because leader must have a reference on the BufFile when workers are released. So, there is a kind of interlock that makes sure that there is always at least 1 owner. ** There would be no need for an additional wait but for the fact the leader wants to unify multiple worker BufFiles as one, and must open them all at once for the sake of simplicity. But that's just how parallel tuplesort in particular happens to work, since it has only one BufFile in the leader, which it wants to operate on with everything set up up-front. ** Thomas' design cannot reliably know how many segments there are in workers in error paths, which necessitates his unlink()-ENOENT-ignore hack. My solution is that workers/owners look after their own temp segments in the conventional way, until they reach BufFileClose(), which may never come if there is an error. The only way that clean-up won't happen in conventional resowner.c-in-worker fashion is if BufFileClose() is reached in owner/worker. BufFileClose() must be reached when there is no error, which has to happen anyway when using temp files. (Else there is a temp file leak warnin
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Wed, Mar 22, 2017 at 10:03 AM, Robert Haas wrote: > On Tue, Mar 21, 2017 at 3:50 PM, Peter Geoghegan wrote: >> I disagree with that. It is a >> trade-off, I suppose. I have now run out of time to work through it >> with you or Thomas, though. > > Bummer. I'm going to experiment with refactoring the v10 parallel CREATE INDEX patch to use the SharedBufFileSet interface from hj-shared-buf-file-v8.patch today and see what problems I run into. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Tue, Mar 21, 2017 at 3:50 PM, Peter Geoghegan wrote: > On Tue, Mar 21, 2017 at 12:06 PM, Robert Haas wrote: >> From my point of view, the main point is that having two completely >> separate mechanisms for managing temporary files that need to be >> shared across cooperating workers is not a good decision. That's a >> need that's going to come up over and over again, and it's not >> reasonable for everybody who needs it to add a separate mechanism for >> doing it. We need to have ONE mechanism for it. > > Obviously I understand that there is value in code reuse in general. > The exact extent to which code reuse is possible here has been unclear > throughout, because it's complicated for all kinds of reasons. That's > why Thomas and I had 2 multi-hour Skype calls all about it. I agree that the extent to which code reuse is possible here is somewhat unclear, but I am 100% confident that the answer is non-zero. You and Thomas both need BufFiles that can be shared across multiple backends associated with the same ParallelContext. I don't understand how you can argue that it's reasonable to have two different ways of sharing the same kind of object across the same set of processes. And if that's not reasonable, then somehow we need to come up with a single mechanism that can meet both your requirements and Thomas's requirements. >> It's just not OK in my book for a worker to create something that it >> initially owns and then later transfer it to the leader. > > Isn't that an essential part of having a refcount, in general? You > were the one that suggested refcounting. No, quite the opposite. My point in suggesting adding a refcount was to avoid needing to have a single owner. Instead, the process that decrements the reference count to zero becomes responsible for doing the cleanup. What you've done with the ref count is use it as some kind of medium for transferring responsibility from backend A to backend B; what I want is to allow backends A, B, C, D, E, and F to attach to the same shared resource, and whichever one of them happens to be the last one out of the room shuts off the lights. >> The cooperating backends should have joint ownership of the objects from >> the beginning, and the last process to exit the set should clean up >> those resources. > > That seems like a facile summary of the situation. There is a sense in > which there is always joint ownership of files with my design. But > there is also a sense is which there isn't, because it's impossible to > do that while not completely reinventing resource management of temp > files. I wanted to preserve resowner.c ownership of fd.c segments. As I've said before, I think that's an anti-goal. This is a different problem, and trying to reuse the solution we chose for the non-parallel case doesn't really work. resowner.c could end up owning a shared reference count which it's responsible for decrementing -- and then decrementing it removes the file if the result is zero. But it can't own performing the actual unlink(), because then we can't support cases where the file may have multiple readers, since whoever owns the unlink() might try to zap the file out from under one of the others. > You maintain that it's better to have the leader unlink() everything > at the end, and suppress the errors when that doesn't work, so that > that path always just plows through. I don't want the leader to be responsible for anything. I want the last process to detach to be responsible for cleanup, regardless of which process that ends up being. I want that for lots of good reasons which I have articulated including (1) it's how all other resource management for parallel query already works, e.g. DSM, DSA, and group locking; (2) it avoids the need for one process to sit and wait until another process assumes ownership, which isn't a feature even if (as you contend, and I'm not convinced) it doesn't hurt much; and (3) it allows for use cases where multiple processes are reading from the same shared BufFile without the risk that some other process will try to unlink() the file while it's still in use. The point for me isn't so much whether unlink() ever ignores errors as whether cleanup (however defined) is an operation guaranteed to happen exactly once. > I disagree with that. It is a > trade-off, I suppose. I have now run out of time to work through it > with you or Thomas, though. Bummer. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Tue, Mar 21, 2017 at 12:06 PM, Robert Haas wrote: > From my point of view, the main point is that having two completely > separate mechanisms for managing temporary files that need to be > shared across cooperating workers is not a good decision. That's a > need that's going to come up over and over again, and it's not > reasonable for everybody who needs it to add a separate mechanism for > doing it. We need to have ONE mechanism for it. Obviously I understand that there is value in code reuse in general. The exact extent to which code reuse is possible here has been unclear throughout, because it's complicated for all kinds of reasons. That's why Thomas and I had 2 multi-hour Skype calls all about it. > It's just not OK in my book for a worker to create something that it > initially owns and then later transfer it to the leader. Isn't that an essential part of having a refcount, in general? You were the one that suggested refcounting. > The cooperating backends should have joint ownership of the objects from > the beginning, and the last process to exit the set should clean up > those resources. That seems like a facile summary of the situation. There is a sense in which there is always joint ownership of files with my design. But there is also a sense is which there isn't, because it's impossible to do that while not completely reinventing resource management of temp files. I wanted to preserve resowner.c ownership of fd.c segments. You maintain that it's better to have the leader unlink() everything at the end, and suppress the errors when that doesn't work, so that that path always just plows through. I disagree with that. It is a trade-off, I suppose. I have now run out of time to work through it with you or Thomas, though. > But even if were true that the waits will always be brief, I still > think the way you've done it is a bad idea, because now tuplesort.c > has to know that it needs to wait because of some detail of > lower-level resource management about which it should not have to > care. That alone is a sufficient reason to want a better approach. There is already a point at which the leader needs to wait, so that it can accumulate stats that nbtsort.c cares about. So we already need a leader wait point within nbtsort.c (that one is called directly by nbtsort.c). Doesn't seem like too bad of a wart to have the same thing for workers. >> I believe that the main reason that you like the design I came up with >> on the whole is that it's minimally divergent from the serial case. > > That's part of it, I guess, but it's more that the code you've added > to do parallelism here looks an awful lot like what's gotten added to > do parallelism in other cases, like parallel query. That's probably a > good sign. It's also a good sign that it makes CREATE INDEX approximately 3 times faster. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Tue, Mar 21, 2017 at 2:03 PM, Peter Geoghegan wrote: > I think that since the comment refers to code from before 1999, it can > go. Any separate patch to remove it would have an entirely negative > linediff. It's a good general principle that a patch should do one thing well and not make unrelated changes. I try hard to adhere to that principle in my commits, and I think other committers generally do (and should), too. Of course, different people draw the line in different places. If you can convince another committer to include that change in their commit of this patch, well, that's not my cup of tea, but so be it. If you want me to consider committing this, you're going to have to submit that part separately, preferably on a separate thread with a suitably descriptive subject line. > Obviously iff you write to the file in the leader, there is little > that the worker can do afterwards, but it's not a given that you'd > want to do that, and this patch actually never does. You could equally > well say that PHJ fails to provide for my requirement for having the > leader write to the files sensibly in order to recycle blocks, a > requirement that its shared BufFile mechanism expressly does not > support. >From my point of view, the main point is that having two completely separate mechanisms for managing temporary files that need to be shared across cooperating workers is not a good decision. That's a need that's going to come up over and over again, and it's not reasonable for everybody who needs it to add a separate mechanism for doing it. We need to have ONE mechanism for it. The second point is that I'm pretty convinced that the design you've chosen is fundamentally wrong. I've attempt to explain that multiple times, starting about three months ago with http://postgr.es/m/ca+tgmoyp0vzpw64dfmqt1jhy6szyavjoglkj3ermzzzn2f9...@mail.gmail.com and continuing across many subsequent emails on multiple threads. It's just not OK in my book for a worker to create something that it initially owns and then later transfer it to the leader. The cooperating backends should have joint ownership of the objects from the beginning, and the last process to exit the set should clean up those resources. >> That would cut hundreds of >> lines from this patch with no real disadvantage that I can see -- >> including things like worker_wait(), which are only needed because of >> the shortcomings of the underlying mechanism. > > I think it would definitely be a significant net gain in LOC. And, > worker_wait() will probably be replaced by the use of the barrier > abstraction anyway. No, because if you do it Thomas's way, the worker can exit right away, without waiting. You don't have to wait via a different method; you escape waiting altogether. I understand that your point is that the wait will always be brief, but I think that's probably an optimistic assumption and definitely an unnecessary assumption. It's optimistic because there is absolutely no guarantee that all workers will take the same amount of time to sort the data they read. It is absolutely not the case that all data sets sort at the same speed. Because of the way parallel sequential scan works, we're somewhat insulated from that; workers that sort faster will get a larger chunk of the table. However, that only means that workers will finish generating their sorted runs at about the same time, not that they will finish merging at the same time. And, indeed, if some workers end up with more data than others (so that they finish building runs at about the same time) then some will probably take longer to complete the merging than others. But even if were true that the waits will always be brief, I still think the way you've done it is a bad idea, because now tuplesort.c has to know that it needs to wait because of some detail of lower-level resource management about which it should not have to care. That alone is a sufficient reason to want a better approach. I completely accept that whatever abstraction we use at the BufFile level has to be something that can be plumbed into logtape.c, and if Thomas's mechanism can't be bolted in there in a sensible way then that's a problem. But I feel quite strongly that the solution to that problem isn't to adopt the approach you've taken here. >> + * run. Parallel workers always use quicksort, however. >> >> Comment fails to mention a reason. > > Well, I don't think that there is any reason to use replacement > selection at all, what with the additional merge heap work last year. > But, the theory there remains that RS is good when you can get one big > run and no merge. You're not going to get that with parallel sort in > any case, since the leader must merge. Besides, merging in the workers > happens in the workers. And, the backspace requirement of 32MB of > workMem per participant pretty much eliminates any use of RS that > you'd get otherwise. So, please mention that briefly in the comment. > I
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Tue, Mar 21, 2017 at 9:10 AM, Robert Haas wrote: > - * This code is moderately slow (~10% slower) compared to the regular > - * btree (insertion) build code on sorted or well-clustered data. On > - * random data, however, the insertion build code is unusable -- the > - * difference on a 60MB heap is a factor of 15 because the random > - * probes into the btree thrash the buffer pool. (NOTE: the above > - * "10%" estimate is probably obsolete, since it refers to an old and > - * not very good external sort implementation that used to exist in > - * this module. tuplesort.c is almost certainly faster.) > > While I agree that the old comment is probably inaccurate, I don't > think dropping it without comment in a patch to implement parallel > sorting is the way to go. How about updating it to be more current as > a separate patch? I think that since the comment refers to code from before 1999, it can go. Any separate patch to remove it would have an entirely negative linediff. > +/* Magic numbers for parallel state sharing */ > 1, 2, and 3 would probably work just as well. Okay. > Why is this being put in planner.c rather than something specific to > creating indexes? Not sure that's a good idea. The idea is that it's the planner's domain, but this is a utility statement, so it makes sense to put it next to the CLUSTER function that determine whether CLUSTER sorts rather than does an index scan. I don't have strong feelings on how appropriate that is. > + * This should be called when workers have flushed out temp file buffers and > + * yielded control to caller's process. Workers should hold open their > + * BufFiles at least until the caller's process is able to call here and > + * assume ownership of BufFile. The general pattern is that workers make > + * available data from their temp files to one nominated process; there is > + * no support for workers that want to read back data from their original > + * BufFiles following writes performed by the caller, or any other > + * synchronization beyond what is implied by caller contract. All > + * communication occurs in one direction. All output is made available to > + * caller's process exactly once by workers, following call made here at the > + * tail end of processing. > > Thomas has designed a system for sharing files among cooperating > processes that lacks several of these restrictions. With his system, > it's still necessary for all data to be written and flushed by the > writer before anybody tries to read it. But the restriction that the > worker has to hold its BufFile open until the leader can assume > ownership goes away. That's a good thing; it avoids the need for > workers to sit around waiting for the leader to assume ownership of a > resource instead of going away faster and freeing up worker slots for > some other query, or moving on to some other computation. The > restriction that the worker can't reread the data after handing off > the file also goes away. There is no restriction about workers not being able to reread data. That comment makes it clear that that's only when the leader writes to the file. It alludes to rereading within a worker following the leader writing to their files in order to recycle blocks within logtape.c, which the patch never has to do, unless you enable one of the 0002-* testing GUCs to force randomAccess. Obviously iff you write to the file in the leader, there is little that the worker can do afterwards, but it's not a given that you'd want to do that, and this patch actually never does. You could equally well say that PHJ fails to provide for my requirement for having the leader write to the files sensibly in order to recycle blocks, a requirement that its shared BufFile mechanism expressly does not support. > That would cut hundreds of > lines from this patch with no real disadvantage that I can see -- > including things like worker_wait(), which are only needed because of > the shortcomings of the underlying mechanism. I think it would definitely be a significant net gain in LOC. And, worker_wait() will probably be replaced by the use of the barrier abstraction anyway. It didn't seem worth creating a dependency on early given my simple requirements. PHJ uses barriers instead, presumably because there is much more of this stuff. The workers generally won't have to wait at all. It's expected to be pretty much instantaneous. > + * run. Parallel workers always use quicksort, however. > > Comment fails to mention a reason. Well, I don't think that there is any reason to use replacement selection at all, what with the additional merge heap work last year. But, the theory there remains that RS is good when you can get one big run and no merge. You're not going to get that with parallel sort in any case, since the leader must merge. Besides, merging in the workers happens in the workers. And, the backspace requirement of 32MB of workMem per participant pretty much eliminates any use o
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Sun, Mar 19, 2017 at 9:03 PM, Peter Geoghegan wrote: > On Sun, Mar 12, 2017 at 3:05 PM, Peter Geoghegan wrote: >> I attach my V9 of the patch. I came up some stuff for the design of >> resource management that I think meets every design goal that we have >> for shared/unified BufFiles: > > Commit 2609e91fc broke the parallel CREATE INDEX cost model. I should > now pass -1 as the index block argument to compute_parallel_worker(), > just as all callers that aren't parallel index scan do after that > commit. This issue caused V9 to never choose parallel CREATE INDEX > within nbtsort.c. There was also a small amount of bitrot. > > Attached V10 fixes this regression. I also couldn't resist adding a > few new assertions that I thought were worth having to buffile.c, plus > dedicated wait events for parallel tuplesort. And, I fixed a silly bug > added in V9 around where worker_wait() should occur. Some initial review comments: - * This code is moderately slow (~10% slower) compared to the regular - * btree (insertion) build code on sorted or well-clustered data. On - * random data, however, the insertion build code is unusable -- the - * difference on a 60MB heap is a factor of 15 because the random - * probes into the btree thrash the buffer pool. (NOTE: the above - * "10%" estimate is probably obsolete, since it refers to an old and - * not very good external sort implementation that used to exist in - * this module. tuplesort.c is almost certainly faster.) While I agree that the old comment is probably inaccurate, I don't think dropping it without comment in a patch to implement parallel sorting is the way to go. How about updating it to be more current as a separate patch? +/* Magic numbers for parallel state sharing */ +#define PARALLEL_KEY_BTREE_SHARED UINT64CONST(0xA001) +#define PARALLEL_KEY_TUPLESORT UINT64CONST(0xA002) +#define PARALLEL_KEY_TUPLESORT_SPOOL2 UINT64CONST(0xA003) 1, 2, and 3 would probably work just as well. The parallel infrastructure uses high-numbered values to avoid conflict with plan_node_id values, but this is a utility statement so there's no such problem. But it doesn't matter very much. + * Note: caller had better already hold some type of lock on the table and + * index. + */ +int +plan_create_index_workers(Oid tableOid, Oid indexOid) Caller should pass down the Relation rather than the Oid. That is better both because it avoids unnecessary work and because it more or less automatically avoids the problem mentioned in the note. Why is this being put in planner.c rather than something specific to creating indexes? Not sure that's a good idea. + * This should be called when workers have flushed out temp file buffers and + * yielded control to caller's process. Workers should hold open their + * BufFiles at least until the caller's process is able to call here and + * assume ownership of BufFile. The general pattern is that workers make + * available data from their temp files to one nominated process; there is + * no support for workers that want to read back data from their original + * BufFiles following writes performed by the caller, or any other + * synchronization beyond what is implied by caller contract. All + * communication occurs in one direction. All output is made available to + * caller's process exactly once by workers, following call made here at the + * tail end of processing. Thomas has designed a system for sharing files among cooperating processes that lacks several of these restrictions. With his system, it's still necessary for all data to be written and flushed by the writer before anybody tries to read it. But the restriction that the worker has to hold its BufFile open until the leader can assume ownership goes away. That's a good thing; it avoids the need for workers to sit around waiting for the leader to assume ownership of a resource instead of going away faster and freeing up worker slots for some other query, or moving on to some other computation. The restriction that the worker can't reread the data after handing off the file also goes away. The files can be read and written by any participant in any order, as many times as you like, with only the restriction that the caller must guarantee that data will be written and flushed from private buffers before it can be read. I don't see any reason to commit both his system and your system, and his is more general so I think you should use it. That would cut hundreds of lines from this patch with no real disadvantage that I can see -- including things like worker_wait(), which are only needed because of the shortcomings of the underlying mechanism. + * run. Parallel workers always use quicksort, however. Comment fails to mention a reason. +elog(LOG, "%d using " INT64_FORMAT " KB of memory for read buffers among %d input tapes", + state->worker, state->availMem / 1024, numInputTapes);
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Sun, Mar 12, 2017 at 3:05 PM, Peter Geoghegan wrote: > I attach my V9 of the patch. I came up some stuff for the design of > resource management that I think meets every design goal that we have > for shared/unified BufFiles: Commit 2609e91fc broke the parallel CREATE INDEX cost model. I should now pass -1 as the index block argument to compute_parallel_worker(), just as all callers that aren't parallel index scan do after that commit. This issue caused V9 to never choose parallel CREATE INDEX within nbtsort.c. There was also a small amount of bitrot. Attached V10 fixes this regression. I also couldn't resist adding a few new assertions that I thought were worth having to buffile.c, plus dedicated wait events for parallel tuplesort. And, I fixed a silly bug added in V9 around where worker_wait() should occur. -- Peter Geoghegan 0001-Add-parallel-B-tree-index-build-sorting.patch.gz Description: GNU Zip compressed data 0002-Add-temporary-testing-tools.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Sun, Mar 12, 2017 at 3:05 PM, Peter Geoghegan wrote: > There is still an open item here, though: The leader-as-worker > Tuplesortstate, a special case, can still leak files. I phrased this badly. What I mean is that there can be instances where temp files are left on disk following a failure such as palloc() OOM; no backend ends up doing an unlink() iff a leader-as-worker Tuplesortstate was used and we get unlucky. I did not mean a leak of virtual or real file descriptors, which would see Postgres print a refcount leak warning from resowner.c. Naturally, these "leaked" files will eventually be deleted by the next restart of the server at the latest, within RemovePgTempFiles(). Note also that a duplicate unlink() (with annoying LOG message) is impossible under any circumstances with V9, regardless of whether or not a leader-as-worker Tuplesort state is involved. Anyway, I was sure that I needed to completely nail this down in order to be consistent with existing guarantees, but another look at OpenTemporaryFile() makes me doubt that. ResourceOwnerEnlargeFiles() is called, which itself uses palloc(), which can of course fail. There are remarks over that function within resowner.c about OOM: /* * Make sure there is room for at least one more entry in a ResourceOwner's * files reference array. * * This is separate from actually inserting an entry because if we run out * of memory, it's critical to do so *before* acquiring the resource. */ void ResourceOwnerEnlargeFiles(ResourceOwner owner) { ... } But this happens after OpenTemporaryFileInTablespace() has already returned. Taking care to allocate memory up-front here is motivated by keeping the vFD cache entry and current resource owner in perfect agreement about the FD_XACT_TEMPORARY-ness of a file, and that's it. It's *not* true that there is a broader sense in which OpenTemporaryFile() is atomic, which for some reason I previously believed to be the case. So, I haven't failed to prevent an outcome that wasn't already possible. It doesn't seem like it would be that hard to fix this, and then have the parallel tuplesort patch live up to that new higher standard. But, it's possible that Tom or maybe someone else would consider that a bad idea, for roughly the same reason that we don't call RemovePgTempFiles() for *crash* induced restarts, as mentioned by Thomas up-thead: * NOTE: we could, but don't, call this during a post-backend-crash restart * cycle. The argument for not doing it is that someone might want to examine * the temp files for debugging purposes. This does however mean that * OpenTemporaryFile had better allow for collision with an existing temp * file name. */ void RemovePgTempFiles(void) { ... } Note that I did put some thought into making sure OpenTemporaryFile() does the right thing with collisions with existing temp files. So, maybe the right thing is to do nothing at all. I don't have strong feelings either way on this question. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Thu, Feb 16, 2017 at 8:45 AM, Peter Geoghegan wrote: >> I do not think there should be any reason why we can't get the >> resource accounting exactly correct here. If a single backend manages >> to remove every temporary file that it creates exactly once (and >> that's currently true, modulo system crashes), a group of cooperating >> backends ought to be able to manage to remove every temporary file >> that any of them create exactly once (again, modulo system crashes). > > I believe that we are fully in agreement here. In particular, I think > it's bad that there is an API that says "caller shouldn't throw an > elog error between these two points", and that will be fixed before > too long. I just think that it's worth acknowledging a certain nuance. I attach my V9 of the patch. I came up some stuff for the design of resource management that I think meets every design goal that we have for shared/unified BufFiles: * Avoids both resource leaks, and spurious double-freeing of resources (e.g., a second unlink() for a file from a different process) when there are errors. The latter problem was possible before, a known issue with V8 of the patch. I believe that this revision avoids these problems in a way that is *absolutely bulletproof* in the face of arbitrary failures (e.g., palloc() failure) in any process at any time. Although, be warned that there is a remaining open item concerning resource management in the leader-as-worker case, which I go into below. There are now what you might call "critical sections" in one function. That is, there are points where we cannot throw an error (without a BEGIN_CRIT_SECTION()!), but those are entirely confined to unification code within the leader, where we can be completely sure that no error can be raised. The leader can even fail before some but not all of a particular worker's segments are in its local resource manager, and we still do the right thing. I've been testing this by adding code that randomly throws errors at points interspersed throughout worker and leader unification hand-off points. I then leave this stress-test build to run for a few hours, while monitoring for leaked files and spurious fd.c reports of double-unlink() and similar issues. Test builds change LOG to PANIC within several places in fd.c, while MAX_PHYSICAL_FILESIZE was reduced from 1GiB to BLCKSZ. All of these guarantees are made without any special care from caller to buffile.c. The only V9 change to tuplesort.c or logtape.c in this general area is that they have to pass a dynamic shared memory segment to buffile.c, so that it can register a new callback. That's it. This may be of particular interest to Thomas. All complexity is confined to buffile.c. * No expansion in the use of shared memory to manage resources. BufFile refcount is still per-worker. The role of local resource managers is unchanged. * Additional complexity over and above ordinary BufFile resource management is confined to the leader process and its on_dsm_detach() callback. Only the leader registers a callback. Of course, refcount management within BufFileClose() can still take place in workers, but that isn't something that we rely on (that's only for non-error paths). In general, worker processes mostly have resource managers managing their temp file segments as a thing that has nothing to do with BufFiles (BufFiles are still not owned by resowner.c/fd.c -- they're blissfully unaware of all of this stuff). * In general, unified BufFiles can still be treated in exactly the same way as conventional BufFiles, and things just work, without any special cases being exercised internally. There is still an open item here, though: The leader-as-worker Tuplesortstate, a special case, can still leak files. So, stress-testing will only show the patch to be completely robust against resource leaks when nbtsort.c is modified to enable FORCE_SINGLE_WORKER testing. Despite the name FORCE_SINGLE_WORKER, you can also modify that file to force there to be arbitrary-many workers requested (just change "requested = 1" to something else). The leader-as-worker problem is avoided because we don't have the leader participating as a worker this way, which would otherwise present issues for resowner.c that I haven't got around to fixing just yet. It isn't hard to imagine why this is -- one backend with two FDs for certain fd.c temp segments is just going to cause problems for resowner.c without additional special care. Didn't seem worth blocking on that. I want to prove that my general approach is workable. That problem is confined to one backend's resource manager when it is the leader participating as a worker. It is not a refcount problem. The simplest solution here would be to ban the leader-as-worker case by contract. Alternatively, we could pass fd.c segments from the leader-as-worker Tuplesortstate's BufFile to the leader Tuplesortstate's BufFile without opening or closing anything. This way, there will be no second vFD entry
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Wed, Mar 1, 2017 at 10:29 PM, Thomas Munro wrote: > I'm testing a patch that lets you set up a fixed sized > SharedBufFileSet object in a DSM segment, with its own refcount for > the reason you explained. It supports a dynamically expandable set of > numbered files, so each participant gets to export file 0, file 1, > file 2 and so on as required, in any order. I think this should suit > both Parallel Tuplesort which needs to export just one file from each > participant, and Parallel Shared Hash which doesn't know up front how > many batches it will produce. Not quite ready but I will post a > version tomorrow to get Peter's reaction. See 0007-hj-shared-buf-file-v6.patch in the v6 tarball in the parallel shared hash thread. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Sat, Feb 11, 2017 at 1:52 AM, Robert Haas wrote: > On Thu, Feb 9, 2017 at 6:38 PM, Thomas Munro > wrote: >> Yes, potentially unbounded in rare case. If we plan for N batches, >> and then run out of work_mem because our estimates were just wrong or >> the distributions of keys is sufficiently skewed, we'll run >> HashIncreaseNumBatches, and that could happen more than once. I have >> a suite of contrived test queries that hits all the various modes and >> code paths of hash join, and it includes a query that plans for one >> batch but finishes up creating many, and then the leader exits. I'll >> post that to the other thread along with my latest patch series soon. > > Hmm, OK. So that's going to probably require something where a fixed > amount of DSM can describe an arbitrary number of temp file series. > But that also means this is an even-more-special-purpose tool that > shouldn't be deeply tied into parallel.c so that it can run before any > errors happen. > > Basically, I think the "let's write the code between here and here so > it throws no errors" technique is, for 99% of PostgreSQL programming, > difficult and fragile. We shouldn't rely on it if there is some other > reasonable option. I'm testing a patch that lets you set up a fixed sized SharedBufFileSet object in a DSM segment, with its own refcount for the reason you explained. It supports a dynamically expandable set of numbered files, so each participant gets to export file 0, file 1, file 2 and so on as required, in any order. I think this should suit both Parallel Tuplesort which needs to export just one file from each participant, and Parallel Shared Hash which doesn't know up front how many batches it will produce. Not quite ready but I will post a version tomorrow to get Peter's reaction. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Thu, Feb 16, 2017 at 11:45 AM, Peter Geoghegan wrote: > Maybe I'm just being pedantic here, since we both actually want the > code to do the same thing. Pedantry from either of us? Nah... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Wed, Feb 15, 2017 at 6:05 PM, Thomas Munro wrote: > Here are some results with your latest patch, using the same test as > before but this time with SCALE=100 (= 100,000,000 rows). Cool. > To reduce the number of combinations I did only unique data and built > only non-unique indexes with only 'wide' tuples (= key plus a text > column that holds a 151-character wide string, rather than just the > key), and also didn't bother with the 1MB memory size as suggested. > Here are the results up to 4 workers (a results table going up to 8 > workers is attached, since it wouldn't format nicely if I pasted it > here). I think that you are still I/O bound in a way that is addressable by adding more disks. The exception is the text cases, where the patch does best. (I don't place too much emphasis on that because I know that in the long term, we'll have abbreviated keys, which will take some of the sheen off of that.) > Again, the w = 0 time is seconds, the rest show relative > speed-up. I think it's worth pointing out that while there are cases where we see no benefit from going from 4 to 8 workers, it tends to hardly hurt at all, or hardly help at all. It's almost irrelevant that the number of workers used is excessive, at least up until the point when all cores have their own worker. That's a nice quality for this to have -- the only danger is that we use parallelism when we shouldn't have at all, because the serial case could manage an internal sort, and the sort was small enough that that could be a notable factor. > "textwide" "asc" is nearly an order of magnitude faster than other > initial orders without parallelism, but then parallelism doesn't seem > to help it much. Also, using more that 64MB doesn't ever seem to help > very much; in the "desc" case it hinders. Maybe it's CPU cache efficiency? There are edge cases where multiple passes are faster than one pass. That'ks the only explanation I can think of. > I was curious to understand how performance changes if we become just > a bit less correlated (rather than completely uncorrelated or > perfectly inversely correlated), so I tried out a 'banana skin' case: > I took the contents of the textwide asc table and copied it to a new > table, and then moved the 900 words matching 'banana%' to the physical > end of the heap by deleting and reinserting them in one transaction. A likely problem with that is that most runs will actually not have their own banana skin, so to speak. You only see a big drop in performance when every quicksort operation has presorted input, but with one or more out-of-order tuples at the end. In order to see a really unfortunate case with parallel CREATE INDEX, you'd probably have to have enough memory that workers don't need to do their own merge (and so worker's work almost entirely consists of one big quicksort operation), with enough "banana skin heap pages" that the parallel heap scan is pretty much guaranteed to end up giving "banana skin" (out of order) tuples to every worker, making all of them "have a slip" (throw away a huge amount of work as the presorted optimization is defeated right at the end of its sequential read through). A better approach would be to have several small localized areas across input where input tuples are a little out of order. That would probably show that the performance is pretty in line with random cases. > It's hard to speculate about this, but I guess that a significant > number of indexes in real world databases might be uncorrelated to > insert order. That would certainly be true with text, where we see a risk of (small) regressions. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Thu, Feb 16, 2017 at 6:28 AM, Robert Haas wrote: > On Thu, Feb 9, 2017 at 7:10 PM, Peter Geoghegan wrote: >> At the risk of stating the obvious, ISTM that the right way to do >> this, at a high level, is to err on the side of unneeded extra >> unlink() calls, not leaking files. And, to make the window for problem >> ("remaining hole that you haven't quite managed to plug") practically >> indistinguishable from no hole at all, in a way that's kind of baked >> into the API. > > I do not think there should be any reason why we can't get the > resource accounting exactly correct here. If a single backend manages > to remove every temporary file that it creates exactly once (and > that's currently true, modulo system crashes), a group of cooperating > backends ought to be able to manage to remove every temporary file > that any of them create exactly once (again, modulo system crashes). I believe that we are fully in agreement here. In particular, I think it's bad that there is an API that says "caller shouldn't throw an elog error between these two points", and that will be fixed before too long. I just think that it's worth acknowledging a certain nuance. > I do agree that a duplicate unlink() call isn't as bad as a missing > unlink() call, at least if there's no possibility that the filename > could have been reused by some other process, or some other part of > our own process, which doesn't want that new file unlinked. But it's > messy. If the seatbelts in your car were to randomly unbuckle, that > would be a safety hazard. If they were to randomly refuse to > unbuckle, you wouldn't say "that's OK because it's not a safety > hazard", you'd say "these seatbelts are badly designed". And I think > the same is true of this mechanism. If it happened in the lifetime of only one out of a million seatbelts manufactured, and they were manufactured at a competitive price (not over-engineered), I probably wouldn't say that. The fact that the existing resource manger code only LOGs most temp file related failures suggests to me that that's a "can't happen" condition, but we still hedge. I would still like to hedge against even (theoretically) impossible risks. Maybe I'm just being pedantic here, since we both actually want the code to do the same thing. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Thu, Feb 9, 2017 at 7:10 PM, Peter Geoghegan wrote: > At the risk of stating the obvious, ISTM that the right way to do > this, at a high level, is to err on the side of unneeded extra > unlink() calls, not leaking files. And, to make the window for problem > ("remaining hole that you haven't quite managed to plug") practically > indistinguishable from no hole at all, in a way that's kind of baked > into the API. I do not think there should be any reason why we can't get the resource accounting exactly correct here. If a single backend manages to remove every temporary file that it creates exactly once (and that's currently true, modulo system crashes), a group of cooperating backends ought to be able to manage to remove every temporary file that any of them create exactly once (again, modulo system crashes). I do agree that a duplicate unlink() call isn't as bad as a missing unlink() call, at least if there's no possibility that the filename could have been reused by some other process, or some other part of our own process, which doesn't want that new file unlinked. But it's messy. If the seatbelts in your car were to randomly unbuckle, that would be a safety hazard. If they were to randomly refuse to unbuckle, you wouldn't say "that's OK because it's not a safety hazard", you'd say "these seatbelts are badly designed". And I think the same is true of this mechanism. The way to make this 100% reliable is to set things up so that there is joint ownership from the beginning and shared state that lets you know whether the work has already been done. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Sat, Feb 4, 2017 at 2:45 PM, Peter Geoghegan wrote: > It might just have been that the table was too small to be an > effective target for parallel sequential scan with so many workers, > and so a presorted best case CREATE INDEX, which isn't that different, > also fails to see much benefit (compared to what you'd see with a > similar case involving a larger table). In other words, I might have > jumped the gun in emphasizing issues with hardware and I/O bandwidth > over issues around data volume (that I/O parallelism is inherently not > very helpful with these relatively small tables). > > As I've pointed out a couple of times before, bigger sorts will be > more CPU bound because sorting itself has costs that grow > linearithmically, whereas writing out runs has costs that grow > linearly. The relative cost of the I/O can be expected to go down as > input goes up for this reason. At the same time, a larger input might > make better use of I/O parallelism, which reduces the cost paid in > latency to write out runs in absolute terms. Here are some results with your latest patch, using the same test as before but this time with SCALE=100 (= 100,000,000 rows). The table sizes are: List of relations Schema | Name | Type |Owner | Size | Description +--+---+--+---+- public | million_words| table | thomas.munro | 42 MB | public | some_words | table | thomas.munro | 19 MB | public | test_intwide_u_asc | table | thomas.munro | 18 GB | public | test_intwide_u_desc | table | thomas.munro | 18 GB | public | test_intwide_u_rand | table | thomas.munro | 18 GB | public | test_textwide_u_asc | table | thomas.munro | 19 GB | public | test_textwide_u_desc | table | thomas.munro | 19 GB | public | test_textwide_u_rand | table | thomas.munro | 19 GB | To reduce the number of combinations I did only unique data and built only non-unique indexes with only 'wide' tuples (= key plus a text column that holds a 151-character wide string, rather than just the key), and also didn't bother with the 1MB memory size as suggested. Here are the results up to 4 workers (a results table going up to 8 workers is attached, since it wouldn't format nicely if I pasted it here). Again, the w = 0 time is seconds, the rest show relative speed-up. This data was all in the OS page cache because of a dummy run done first, and I verified with 'sar' that there was exactly 0 reading from the block device. The CPU was pegged on leader + workers during sort runs, and then the leader's CPU hovered around 93-98% during the merge/btree build. I had some technical problems getting a cold-cache read-from-actual-disk-each-time test run to work properly, but can go back and do that again if anyone thinks that would be interesting data to see. tab| ord | mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 --+--+-+-+---+---+---+--- intwide | asc | 64 | 67.91 | 1.26x | 1.46x | 1.62x | 1.73x intwide | asc | 256 | 67.84 | 1.23x | 1.48x | 1.63x | 1.79x intwide | asc | 512 | 69.01 | 1.25x | 1.50x | 1.63x | 1.80x intwide | desc | 64 | 98.08 | 1.48x | 1.83x | 2.03x | 2.25x intwide | desc | 256 | 99.87 | 1.43x | 1.80x | 2.03x | 2.29x intwide | desc | 512 | 104.09 | 1.44x | 1.85x | 2.09x | 2.33x intwide | rand | 64 | 138.03 | 1.56x | 2.04x | 2.42x | 2.58x intwide | rand | 256 | 139.44 | 1.61x | 2.04x | 2.38x | 2.56x intwide | rand | 512 | 138.96 | 1.52x | 2.03x | 2.28x | 2.57x textwide | asc | 64 | 207.10 | 1.20x | 1.07x | 1.09x | 1.11x textwide | asc | 256 | 200.62 | 1.19x | 1.06x | 1.04x | 0.99x textwide | asc | 512 | 191.42 | 1.16x | 0.97x | 1.01x | 0.94x textwide | desc | 64 | 1382.48 | 1.89x | 2.37x | 3.18x | 3.87x textwide | desc | 256 | 1427.99 | 1.89x | 2.42x | 3.24x | 4.00x textwide | desc | 512 | 1453.21 | 1.86x | 2.39x | 3.23x | 3.75x textwide | rand | 64 | 1587.28 | 1.89x | 2.37x | 2.66x | 2.75x textwide | rand | 256 | 1557.90 | 1.85x | 2.34x | 2.64x | 2.73x textwide | rand | 512 | 1547.97 | 1.87x | 2.32x | 2.64x | 2.71x "textwide" "asc" is nearly an order of magnitude faster than other initial orders without parallelism, but then parallelism doesn't seem to help it much. Also, using more that 64MB doesn't ever seem to help very much; in the "desc" case it hinders. I was curious to understand how performance changes if we become just a bit less correlated (rather than completely uncorrelated or perfectly inversely correlated), so I tried out a 'banana skin' case: I took the contents of the textwide asc table and copied it to a new table, and then moved the 900 words matching 'banana%' to the physical end of the heap by deleting and reinserting them in one transaction. I guess if we were to use this technology for CLUSTER, this might be representative of a situation where you regularly recluster a growing
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Thu, Feb 9, 2017 at 6:38 PM, Thomas Munro wrote: > Here's my thought process... please tell me where I'm going wrong: > > I have been assuming that it's not enough to just deal with this when > the leader detaches on the theory that other participants will always > detach first: that probably isn't true in some error cases, and could > contribute to spurious racy errors where other workers complain about > disappearing files if the leader somehow shuts down and cleans up > while a worker is still running. Therefore we need *some* kind of > refcounting, whether it's a new kind or a new mechanism based on the > existing kind. +1. > I have also been assuming that we don't want to teach dsm.c directly > about this stuff; it shouldn't need to know about other modules, so we > don't want it talking to buffile.c directly and managing a special > table of files; instead we want a system of callbacks. Therefore > client code needs to do something after attaching to the segment in > each backend. +1. > It doesn't matter whether we use an on_dsm_detach() callback and > manage our own refcount to infer that destruction is imminent, or a > new on_dsm_destroy() callback which tells us so explicitly: both ways > we'll need to make sure that anyone who attaches to the segment also > "attaches" to this shared BufFile manager object inside it, because > any backend might turn out to be the one that is last to detach. Not entirely. In the first case, you don't need the requirement that everyone who attaches the segment must attach to the shared BufFile manager. In the second case, you do. > That bring us to the race you mentioned. Isn't it sufficient to say > that you aren't allowed to do anything that might throw in between > attaching to the segment and attaching to the SharedBufFileManager > that it contains? That would be sufficient, but I think it's not a very good design. It means, for example, that nothing between the time you attach to the segment and the time you attach to this manager can palloc() anything. So, for example, it would have to happen before ParallelWorkerMain reaches the call to shm_mq_attach, which kinda sucks because we want to do that as soon as possible after attaching to the DSM segment so that errors are reported properly thereafter. Note that's the very first thing we do now, except for working out what the arguments to that call need to be. Also, while it's currently safe to assume that shm_toc_attach() and shm_toc_lookup() don't throw errors, I've thought about the possibility of installing some sort of cache in shm_toc_lookup() to amortize the cost of lookups, if the number of keys ever got too large. And that would then require a palloc(). Generally, backend code should be free to throw errors. When it's absolutely necessary for a short segment of code to avoid that, then we do, but you can't really rely on any substantial amount of code to be that way, or stay that way. And in this case, even if we didn't mind those problems or had some solution to them, I think that the shared buffer manager shouldn't have to be something that is whacked directly into parallel.c all the way at the beginning of the initialization sequence so that nothing can fail before it happens. I think it should be an optional data structure that clients of the parallel infrastructure can decide to use, or to not use. It should be at arm's length from the core code, just like the way ParallelQueryMain() is distinct from ParallelWorkerMain() and sets up its own set of data structures with their own set of keys. All that stuff is happy to happen after whatever ParallelWorkerMain() feels that it needs to do, even if ParallelWorkerMain might throw errors for any number of unknown reasons. Similarly, I think this new things should be something than an executor node can decide to create inside its own per-node space -- reserved via ExecParallelEstimate, initialized ExecParallelInitializeDSM, etc. There's no need for it to be deeply coupled to parallel.c itself unless we force that choice by sticking a no-fail requirement in there. > Up until two minutes ago I assumed that policy would leave only two > possibilities: you attach to the DSM segment and attach to the > SharedBufFileManager successfully or you attach to the DSM segment and > then die horribly (but not throw) and the postmaster restarts the > whole cluster and blows all temp files away with RemovePgTempFiles(). > But I see now in the comment of that function that crash-induced > restarts don't call that because "someone might want to examine the > temp files for debugging purposes". Given that policy for regular > private BufFiles, I don't see why that shouldn't apply equally to > shared files: after a crash restart, you may have some junk files that > won't be cleaned up until your next clean restart, whether they were > private or shared BufFiles. I think most people (other than Tom) would agree that that policy isn't really sensib
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Thu, Feb 9, 2017 at 2:31 PM, Robert Haas wrote: > You might think about plugging that hole by moving the registry of > on-destroy functions into the segment itself and making it a shared > resource. But ASLR breaks that, especially for loadable modules. You > could try to fix that problem, in turn, by storing arguments that can > later be passed to load_external_function() instead of a function > pointer per se. But that sounds pretty fragile because some other > backend might not try to load the module until after it's attached the > DSM segment and it might then fail because loading the module runs > _PG_init() which can throw errors. Maybe you can think of a way to > plug that hole too but you're way over your $8 budget by this > point. At the risk of stating the obvious, ISTM that the right way to do this, at a high level, is to err on the side of unneeded extra unlink() calls, not leaking files. And, to make the window for problem ("remaining hole that you haven't quite managed to plug") practically indistinguishable from no hole at all, in a way that's kind of baked into the API. It's not like we currently throw an error when there is a problem with deleting temp files that are no longer needed on resource manager cleanup. We simply log the fact that it happened, and limp on. I attach my V8. This does not yet do anything with on_dsm_detach(). I've run out of time to work on it this week, and am starting a new job next week at VMware, which I'll need time to settle into. So I'm posting this now, since you can still very much see the direction I'm going in, and can give me any feedback that you have. If anyone wants to show me how its done by building on this, and finishing what I have off, be my guest. The new stuff probably isn't quite as polished as I would prefer, but time grows short, so I won't withhold it. Changes: * Implements refcount thing, albeit in a way that leaves a small window for double unlink() calls if there is an error during the small window in which there is worker/leader co-ownership of a BufFile (just add an "elog(ERROR)" just before leader-as-worker Tuplesort state is ended within _bt_leafbuild() to see what I mean). This implies that background workers can be reclaimed once the leader needs to start its final on-the-fly merge, which is nice. As an example of how that's nice, this change makes maintenance_work_mem a budget that we more strictly adhere to. * Fixes bitrot caused by recent logtape.c bugfix in master branch. * No local segment is created during unification unless and until one is required. (In practice, for current use of BufFile infrastructure, no "local" segment is ever created, even if we force a randomAccess case using one of the testing GUCs from 0002-* -- we'd have to use another GUC to *also* force there to be no reclaimation.) * Better testing. As I just mentioned, we can now force logtape.c to not reclaim blocks, so you make new local segments as part of a unified BufFile, which have different considerations from a resource management point of view. Despite being part of the same "unified" BufFile from the leader's perspective, it behaves like a local segment, so it definitely seems like a good idea to have test coverage for this, at least during development. (I have a pretty rough test suite that I'm using; development of this patch has been somewhat test driven.) * Better encapsulation of BufFile stuff. I am even closer to the ideal of this whole sharing mechanism being a fairly generic BufFile thing that logtape.c piggy-backs on without having special knowledge of the mechanism. It's still true that the mechanism (sharing/unification) is written principally with logtape.c in mind, but that's just because of its performance characteristics. Nothing to do with the interface. * Worked through items raised by Thomas in his 2017-01-30 mail to this thread. Secondly, I might not want to be constrained by a fixed-sized DSM segment to hold my SharedBufFile objects... there are cases where I need to shared a number of batch files that is unknown at the start of execution time when the DSM segment is sized (I'll write about that shortly on the Parallel Shared Hash thread). Maybe I can find a way to get rid of that requirement. Or maybe it could support DSA memory too, but I don't think it's possible to use on_dsm_detach-based cleanup routines that refer to DSA memory because by the time any given DSM segment's detach hook runs, there's no telling which other DSM segments have been detached already, so the DSA area may already have partially vanished; some other kind of hook that runs earlier would be needed... >>> >>> Again, wrench. I like the wrench analogy too, FWIW. >> My problem here is that I don't know how many batches I'll finish up >> creating. In general that's OK because I can hold onto them as >> private BufFiles owned by participants with the existing cleanup >> me
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Fri, Feb 10, 2017 at 11:31 AM, Robert Haas wrote: > On Thu, Feb 9, 2017 at 5:09 PM, Thomas Munro > wrote: >> I agree that the pinned segment case doesn't matter right now, I just >> wanted to point it out. I like your $10 wrench analogy, but maybe it >> could be argued that adding a dsm_on_destroy() callback mechanism is >> not only better than adding another refcount to track that other >> refcount, but also a steal at only $8. > > If it's that simple, it might be worth doing, but I bet it's not. One > problem is that there's a race condition: there will inevitably be a > period of time after you've called dsm_attach() and before you've > attached to the specific data structure that we're talking about here. > So suppose the last guy who actually knows about this data structure > dies horribly and doesn't clean up because the DSM isn't being > destroyed; moments later, you die horribly before reaching the code > where you attach to this data structure. Oops. Right, I mentioned this problem earlier ("and also register the on_dsm_detach callback, before any chance that an error might be thrown (is that difficult?); failure to do so could result in file leaks"). Here's my thought process... please tell me where I'm going wrong: I have been assuming that it's not enough to just deal with this when the leader detaches on the theory that other participants will always detach first: that probably isn't true in some error cases, and could contribute to spurious racy errors where other workers complain about disappearing files if the leader somehow shuts down and cleans up while a worker is still running. Therefore we need *some* kind of refcounting, whether it's a new kind or a new mechanism based on the existing kind. I have also been assuming that we don't want to teach dsm.c directly about this stuff; it shouldn't need to know about other modules, so we don't want it talking to buffile.c directly and managing a special table of files; instead we want a system of callbacks. Therefore client code needs to do something after attaching to the segment in each backend. It doesn't matter whether we use an on_dsm_detach() callback and manage our own refcount to infer that destruction is imminent, or a new on_dsm_destroy() callback which tells us so explicitly: both ways we'll need to make sure that anyone who attaches to the segment also "attaches" to this shared BufFile manager object inside it, because any backend might turn out to be the one that is last to detach. That bring us to the race you mentioned. Isn't it sufficient to say that you aren't allowed to do anything that might throw in between attaching to the segment and attaching to the SharedBufFileManager that it contains? Up until two minutes ago I assumed that policy would leave only two possibilities: you attach to the DSM segment and attach to the SharedBufFileManager successfully or you attach to the DSM segment and then die horribly (but not throw) and the postmaster restarts the whole cluster and blows all temp files away with RemovePgTempFiles(). But I see now in the comment of that function that crash-induced restarts don't call that because "someone might want to examine the temp files for debugging purposes". Given that policy for regular private BufFiles, I don't see why that shouldn't apply equally to shared files: after a crash restart, you may have some junk files that won't be cleaned up until your next clean restart, whether they were private or shared BufFiles. > You might think about plugging that hole by moving the registry of > on-destroy functions into the segment itself and making it a shared > resource. But ASLR breaks that, especially for loadable modules. You > could try to fix that problem, in turn, by storing arguments that can > later be passed to load_external_function() instead of a function > pointer per se. But that sounds pretty fragile because some other > backend might not try to load the module until after it's attached the > DSM segment and it might then fail because loading the module runs > _PG_init() which can throw errors. Maybe you can think of a way to > plug that hole too but you're way over your $8 budget by this > point. Agreed, those approaches seem like a non-starters. >> My problem here is that I don't know how many batches I'll finish up >> creating. [...] > > I thought the idea was that the structure we're talking about here > owns all the files, up to 2 from a leader that wandered off plus up to > 2 for each worker. Last process standing removes them. Or are you > saying each worker only needs 2 files but the leader needs a > potentially unbounded number? Yes, potentially unbounded in rare case. If we plan for N batches, and then run out of work_mem because our estimates were just wrong or the distributions of keys is sufficiently skewed, we'll run HashIncreaseNumBatches, and that could happen more than once. I have a suite of contrived test queries that hits all the v
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Thu, Feb 9, 2017 at 5:09 PM, Thomas Munro wrote: > I agree that the pinned segment case doesn't matter right now, I just > wanted to point it out. I like your $10 wrench analogy, but maybe it > could be argued that adding a dsm_on_destroy() callback mechanism is > not only better than adding another refcount to track that other > refcount, but also a steal at only $8. If it's that simple, it might be worth doing, but I bet it's not. One problem is that there's a race condition: there will inevitably be a period of time after you've called dsm_attach() and before you've attached to the specific data structure that we're talking about here. So suppose the last guy who actually knows about this data structure dies horribly and doesn't clean up because the DSM isn't being destroyed; moments later, you die horribly before reaching the code where you attach to this data structure. Oops. You might think about plugging that hole by moving the registry of on-destroy functions into the segment itself and making it a shared resource. But ASLR breaks that, especially for loadable modules. You could try to fix that problem, in turn, by storing arguments that can later be passed to load_external_function() instead of a function pointer per se. But that sounds pretty fragile because some other backend might not try to load the module until after it's attached the DSM segment and it might then fail because loading the module runs _PG_init() which can throw errors. Maybe you can think of a way to plug that hole too but you're way over your $8 budget by this point. >>> Secondly, I might not want to be constrained by a >>> fixed-sized DSM segment to hold my SharedBufFile objects... there are >>> cases where I need to shared a number of batch files that is unknown >>> at the start of execution time when the DSM segment is sized (I'll >>> write about that shortly on the Parallel Shared Hash thread). Maybe I >>> can find a way to get rid of that requirement. Or maybe it could >>> support DSA memory too, but I don't think it's possible to use >>> on_dsm_detach-based cleanup routines that refer to DSA memory because >>> by the time any given DSM segment's detach hook runs, there's no >>> telling which other DSM segments have been detached already, so the >>> DSA area may already have partially vanished; some other kind of hook >>> that runs earlier would be needed... >> >> Again, wrench. > > My problem here is that I don't know how many batches I'll finish up > creating. In general that's OK because I can hold onto them as > private BufFiles owned by participants with the existing cleanup > mechanism, and then share them just before they need to be shared (ie > when we switch to processing the next batch so they need to be > readable by all). Now I only ever share one inner and one outer batch > file per participant at a time, and then I explicitly delete them at a > time that I know to be safe and before I need to share a new file that > would involve recycling the slot, and I'm relying on DSM segment scope > cleanup only to handle error paths. That means that in generally I > only need space for 2 * P shared BufFiles at a time. But there is a > problem case: when the leader needs to exit early, it needs to be able > to transfer ownership of any files it has created, which could be more > than we planned for, and then not participate any further in the hash > join, so it can't participate in the on-demand sharing scheme. I thought the idea was that the structure we're talking about here owns all the files, up to 2 from a leader that wandered off plus up to 2 for each worker. Last process standing removes them. Or are you saying each worker only needs 2 files but the leader needs a potentially unbounded number? > Perhaps we can find a way to describe a variable number of BufFiles > (ie batches) in a fixed space by making sure the filenames are > constructed in a way that lets us just have to say how many there are. That could be done. > Then the next problem is that for each BufFile we have to know how > many 1GB segments there are to unlink (files named foo, foo.1, foo.2, > ...), which Peter's code currently captures by publishing the file > size in the descriptor... but if a fixed size object must describe N > BufFiles, where can I put the size of each one? Maybe I could put it > in a header of the file itself (yuck!), or maybe I could decide that I > don't care what the size is, I'll simply unlink "foo", then "foo.1", > then "foo.2", ... until I get ENOENT. There's nothing wrong with that algorithm as far as I'm concerned. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Fri, Feb 10, 2017 at 9:51 AM, Robert Haas wrote: > On Wed, Feb 8, 2017 at 5:36 AM, Thomas Munro > wrote: >> Thinking a bit harder about this, I suppose there could be a kind of >> object called a SharedBufFileManager [... description of that ...]. > > I think this is approximately reasonable, but I think it could be made > simpler by having fewer separate objects. Let's assume the leader can > put an upper bound on the number of shared BufFiles at the time it's > sizing the DSM segment (i.e. before InitializeParallelDSM). Then it > can allocate a big ol' array with a header indicating the array size > and each element containing enough space to identify the relevant > details of 1 shared BufFile. Now you don't need to do any allocations > later on, and you don't need a linked list. You just loop over the > array and do what needs doing. Makes sense. >> There are a couple of problems with the above though. Firstly, doing >> reference counting in DSM segment on-detach hooks is really a way to >> figure out when the DSM segment is about to be destroyed by keeping a >> separate refcount in sync with the DSM segment's refcount, but it >> doesn't account for pinned DSM segments. It's not your use-case or >> mine currently, but someone might want a DSM segment to live even when >> it's not attached anywhere, to be reattached later. If we're trying >> to use DSM segment lifetime as a scope, we'd be ignoring this detail. >> Perhaps instead of adding our own refcount we need a new kind of hook >> on_dsm_destroy. > > I think it's good enough to plan for current needs now. It's not > impossible to change this stuff later, but we need something that > works robustly right now without being too invasive. Inventing whole > new system concepts because of stuff we might someday want to do isn't > a good idea because we may easily guess wrong about what direction > we'll want to go in the future. This is more like building a wrench > than a 747: a 747 needs to be extensible and reconfigurable and > upgradable because it costs $350 million. A wrench costs $10 at > Walmart and if it turns out we bought the wrong one, we can just throw > it out and get a different one later. I agree that the pinned segment case doesn't matter right now, I just wanted to point it out. I like your $10 wrench analogy, but maybe it could be argued that adding a dsm_on_destroy() callback mechanism is not only better than adding another refcount to track that other refcount, but also a steal at only $8. >> Secondly, I might not want to be constrained by a >> fixed-sized DSM segment to hold my SharedBufFile objects... there are >> cases where I need to shared a number of batch files that is unknown >> at the start of execution time when the DSM segment is sized (I'll >> write about that shortly on the Parallel Shared Hash thread). Maybe I >> can find a way to get rid of that requirement. Or maybe it could >> support DSA memory too, but I don't think it's possible to use >> on_dsm_detach-based cleanup routines that refer to DSA memory because >> by the time any given DSM segment's detach hook runs, there's no >> telling which other DSM segments have been detached already, so the >> DSA area may already have partially vanished; some other kind of hook >> that runs earlier would be needed... > > Again, wrench. My problem here is that I don't know how many batches I'll finish up creating. In general that's OK because I can hold onto them as private BufFiles owned by participants with the existing cleanup mechanism, and then share them just before they need to be shared (ie when we switch to processing the next batch so they need to be readable by all). Now I only ever share one inner and one outer batch file per participant at a time, and then I explicitly delete them at a time that I know to be safe and before I need to share a new file that would involve recycling the slot, and I'm relying on DSM segment scope cleanup only to handle error paths. That means that in generally I only need space for 2 * P shared BufFiles at a time. But there is a problem case: when the leader needs to exit early, it needs to be able to transfer ownership of any files it has created, which could be more than we planned for, and then not participate any further in the hash join, so it can't participate in the on-demand sharing scheme. Perhaps we can find a way to describe a variable number of BufFiles (ie batches) in a fixed space by making sure the filenames are constructed in a way that lets us just have to say how many there are. Then the next problem is that for each BufFile we have to know how many 1GB segments there are to unlink (files named foo, foo.1, foo.2, ...), which Peter's code currently captures by publishing the file size in the descriptor... but if a fixed size object must describe N BufFiles, where can I put the size of each one? Maybe I could put it in a header of the file itself (yuck!), or maybe I could decide that I don't
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Wed, Feb 8, 2017 at 5:36 AM, Thomas Munro wrote: > Thinking a bit harder about this, I suppose there could be a kind of > object called a SharedBufFileManager (insert better name) which you > can store in a DSM segment. The leader backend that initialises a DSM > segment containing one of these would then call a constructor function > that sets an internal refcount to 1 and registers an on_dsm_detach > callback for its on-detach function. All worker backends that attach > to the DSM segment would need to call an attach function for the > SharedBufFileManager to increment a refcount and also register the > on_dsm_detach callback, before any chance that an error might be > thrown (is that difficult?); failure to do so could result in file > leaks. Then, when a BufFile is to be shared (AKA exported, made > unifiable), a SharedBufFile object can be initialised somewhere in the > same DSM segment and registered with the SharedBufFileManager. > Internally all registered SharedBufFile objects would be linked > together using offsets from the start of the DSM segment for link > pointers. Now when SharedBufFileManager's on-detach function runs, it > decrements the refcount in the SharedBufFileManager, and if that > reaches zero then it runs a destructor that spins through the list of > SharedBufFile objects deleting files that haven't already been deleted > explicitly. I think this is approximately reasonable, but I think it could be made simpler by having fewer separate objects. Let's assume the leader can put an upper bound on the number of shared BufFiles at the time it's sizing the DSM segment (i.e. before InitializeParallelDSM). Then it can allocate a big ol' array with a header indicating the array size and each element containing enough space to identify the relevant details of 1 shared BufFile. Now you don't need to do any allocations later on, and you don't need a linked list. You just loop over the array and do what needs doing. > There are a couple of problems with the above though. Firstly, doing > reference counting in DSM segment on-detach hooks is really a way to > figure out when the DSM segment is about to be destroyed by keeping a > separate refcount in sync with the DSM segment's refcount, but it > doesn't account for pinned DSM segments. It's not your use-case or > mine currently, but someone might want a DSM segment to live even when > it's not attached anywhere, to be reattached later. If we're trying > to use DSM segment lifetime as a scope, we'd be ignoring this detail. > Perhaps instead of adding our own refcount we need a new kind of hook > on_dsm_destroy. I think it's good enough to plan for current needs now. It's not impossible to change this stuff later, but we need something that works robustly right now without being too invasive. Inventing whole new system concepts because of stuff we might someday want to do isn't a good idea because we may easily guess wrong about what direction we'll want to go in the future. This is more like building a wrench than a 747: a 747 needs to be extensible and reconfigurable and upgradable because it costs $350 million. A wrench costs $10 at Walmart and if it turns out we bought the wrong one, we can just throw it out and get a different one later. > Secondly, I might not want to be constrained by a > fixed-sized DSM segment to hold my SharedBufFile objects... there are > cases where I need to shared a number of batch files that is unknown > at the start of execution time when the DSM segment is sized (I'll > write about that shortly on the Parallel Shared Hash thread). Maybe I > can find a way to get rid of that requirement. Or maybe it could > support DSA memory too, but I don't think it's possible to use > on_dsm_detach-based cleanup routines that refer to DSA memory because > by the time any given DSM segment's detach hook runs, there's no > telling which other DSM segments have been detached already, so the > DSA area may already have partially vanished; some other kind of hook > that runs earlier would be needed... Again, wrench. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Wed, Feb 8, 2017 at 8:40 PM, Thomas Munro wrote: > On Tue, Feb 7, 2017 at 5:43 PM, Peter Geoghegan wrote: >> Does anyone have any suggestions on how to tackle this? > > Hmm. One approach might be like this: > > [hand-wavy stuff] Thinking a bit harder about this, I suppose there could be a kind of object called a SharedBufFileManager (insert better name) which you can store in a DSM segment. The leader backend that initialises a DSM segment containing one of these would then call a constructor function that sets an internal refcount to 1 and registers an on_dsm_detach callback for its on-detach function. All worker backends that attach to the DSM segment would need to call an attach function for the SharedBufFileManager to increment a refcount and also register the on_dsm_detach callback, before any chance that an error might be thrown (is that difficult?); failure to do so could result in file leaks. Then, when a BufFile is to be shared (AKA exported, made unifiable), a SharedBufFile object can be initialised somewhere in the same DSM segment and registered with the SharedBufFileManager. Internally all registered SharedBufFile objects would be linked together using offsets from the start of the DSM segment for link pointers. Now when SharedBufFileManager's on-detach function runs, it decrements the refcount in the SharedBufFileManager, and if that reaches zero then it runs a destructor that spins through the list of SharedBufFile objects deleting files that haven't already been deleted explicitly. I retract the pin/unpin and per-file refcounting stuff I mentioned earlier. You could make the default that all files registered with a SharedBufFileManager survive until the containing DSM segment is detached everywhere using that single refcount in the SharedBufFileManager object, but also provide a 'no really delete this particular shared file now' operation for client code that knows it's safe to do that sooner (which would be the case for me, I think). I don't think per-file refcounts are needed. There are a couple of problems with the above though. Firstly, doing reference counting in DSM segment on-detach hooks is really a way to figure out when the DSM segment is about to be destroyed by keeping a separate refcount in sync with the DSM segment's refcount, but it doesn't account for pinned DSM segments. It's not your use-case or mine currently, but someone might want a DSM segment to live even when it's not attached anywhere, to be reattached later. If we're trying to use DSM segment lifetime as a scope, we'd be ignoring this detail. Perhaps instead of adding our own refcount we need a new kind of hook on_dsm_destroy. Secondly, I might not want to be constrained by a fixed-sized DSM segment to hold my SharedBufFile objects... there are cases where I need to shared a number of batch files that is unknown at the start of execution time when the DSM segment is sized (I'll write about that shortly on the Parallel Shared Hash thread). Maybe I can find a way to get rid of that requirement. Or maybe it could support DSA memory too, but I don't think it's possible to use on_dsm_detach-based cleanup routines that refer to DSA memory because by the time any given DSM segment's detach hook runs, there's no telling which other DSM segments have been detached already, so the DSA area may already have partially vanished; some other kind of hook that runs earlier would be needed... Hmm. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Tue, Feb 7, 2017 at 5:43 PM, Peter Geoghegan wrote: > However, there are some specific implementation issues with this that > I didn't quite anticipate. I would like to get feedback on these > issues now, from both Thomas and Robert. The issues relate to how much > the patch can or should "buy into resource management". You might > guess that this new resource management code is something that should > live in fd.c, alongside the guts of temp file resource management, > within the function FileClose(). That way, it would be called by every > possible path that might delete a temp file, including > ResourceOwnerReleaseInternal(). That's not what I've done, though. > Instead, refcount management is limited to a few higher level routines > in buffile.c. Initially, resource management in FileClose() is made to > assume that it must delete the file. Then, if and when directed to by > BufFileClose()/refcount, a backend may determine that it is not its > job to do the deletion -- it will not be the one that must "turn out > the lights", and so indicates to FileClose() that it should not delete > the file after all (it should just release vFDs, close(), and so on). > Otherwise, when refcount reaches zero, temp files are deleted by > FileClose() in more or less the conventional manner. > > The fact that there could, in general, be any error that causes us to > attempt a double-deletion (deletion of a temp file from more than one > backend) for a time is less of a problem than you might think. This is > because there is a risk of this only for as long as two backends hold > open the file at the same time. In the case of parallel CREATE INDEX, > this is now the shortest possible period of time, since workers close > their files using BufFileClose() immediately after the leader wakes > them up from a quiescent state. And, if that were to actually happen, > say due to some random OOM error during that small window, the > consequence is no worse than an annoying log message: "could not > unlink file..." (this would come from the second backend that > attempted an unlink()). You would not see this when a worker raised an > error due to a duplicate violation, or any other routine problem, so > it should really be almost impossible. > > That having been said, this probably *is* a problematic restriction in > cases where a temp file's ownership is not immediately handed over > without concurrent sharing. What happens to be a small window for the > parallel CREATE INDEX patch probably wouldn't be a small window for > parallel hash join. :-( > > It's not hard to see why I would like to do things this way. Just look > at ResourceOwnerReleaseInternal(). Any release of a file happens > during RESOURCE_RELEASE_AFTER_LOCKS, whereas the release of dynamic > shared memory segments happens earlier, during > RESOURCE_RELEASE_BEFORE_LOCKS. ISTM that the only sensible way to > implement a refcount is using dynamic shared memory, and that seems > hard. There are additional reasons why I suggest we go this way, such > as the fact that all the relevant state belongs to BufFile, which is > implemented a layer above all of the guts of resource management of > temp files within fd.c. I'd have to replicate almost all state in fd.c > to make it all work, which seems like a big modularity violation. > > Does anyone have any suggestions on how to tackle this? Hmm. One approach might be like this: 1. There is a shared refcount which is incremented when you open a shared file and decremented if you optionally explicitly 'release' it. (Not when you close it, because we can't allow code that may be run during RESOURCE_RELEASE_AFTER_LOCKS to try to access the DSM segment after it has been unmapped; more generally, creating destruction order dependencies between different kinds of resource-manager-cleaned-up objects seems like a bad idea. Of course the close code still looks after closing the vfds in the local backend.) 2. If you want to hand the file over to some other process and exit, you probably want to avoid race conditions or extra IPC burden. To achieve that you could 'pin' the file, so that it survives even while not open in any backend. 3. If the recount reaches zero when you 'release' and the file isn't 'pinned', then you must delete the underlying files. 4. When the DSM segment is detached, we spin through all associated shared files that we're still 'attached' to (ie opened but didn't release) and decrement the refcount. If any shared file's refcount reaches zero its files should be deleted, even if was 'pinned'. In other words, the associated DSM segment's lifetime is the maximum lifetime of shared files, but it can be shorter if you 'release' in all backends and don't 'pin'. It's up to client code can come up with some scheme to make that work, if it doesn't take the easy route of pinning until DSM segment destruction. I think in your case you'd simply pin all the BufFiles allowing workers to exit when they're done;
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Mon, Jan 30, 2017 at 9:15 PM, Peter Geoghegan wrote: >> IIUC worker_wait() is only being used to keep the worker around so its >> files aren't deleted. Once buffile cleanup is changed to be >> ref-counted (in an on_dsm_detach hook?) then workers might as well >> exit sooner, freeing up a worker slot... do I have that right? > > Yes. Or at least I think it's very likely that that will end up happening. I've looked into this, and have a version of the patch where clean-up occurs when the last backend with a reference to the BufFile goes away. It seems robust; all of my private tests pass, which includes things that parallel CREATE INDEX won't use, and yet is added as infrastructure (e.g., randomAccess recycling of blocks by leader from workers). As Thomas anticipated, worker_wait() now only makes workers wait until the leader comes along to take a reference to their files, at which point the worker processes can go away. In effect, the worker processes go away as soon as possible, just as the leader begins its final on-the-fly merge. At that point, they could be reused by some other process, of course. However, there are some specific implementation issues with this that I didn't quite anticipate. I would like to get feedback on these issues now, from both Thomas and Robert. The issues relate to how much the patch can or should "buy into resource management". You might guess that this new resource management code is something that should live in fd.c, alongside the guts of temp file resource management, within the function FileClose(). That way, it would be called by every possible path that might delete a temp file, including ResourceOwnerReleaseInternal(). That's not what I've done, though. Instead, refcount management is limited to a few higher level routines in buffile.c. Initially, resource management in FileClose() is made to assume that it must delete the file. Then, if and when directed to by BufFileClose()/refcount, a backend may determine that it is not its job to do the deletion -- it will not be the one that must "turn out the lights", and so indicates to FileClose() that it should not delete the file after all (it should just release vFDs, close(), and so on). Otherwise, when refcount reaches zero, temp files are deleted by FileClose() in more or less the conventional manner. The fact that there could, in general, be any error that causes us to attempt a double-deletion (deletion of a temp file from more than one backend) for a time is less of a problem than you might think. This is because there is a risk of this only for as long as two backends hold open the file at the same time. In the case of parallel CREATE INDEX, this is now the shortest possible period of time, since workers close their files using BufFileClose() immediately after the leader wakes them up from a quiescent state. And, if that were to actually happen, say due to some random OOM error during that small window, the consequence is no worse than an annoying log message: "could not unlink file..." (this would come from the second backend that attempted an unlink()). You would not see this when a worker raised an error due to a duplicate violation, or any other routine problem, so it should really be almost impossible. That having been said, this probably *is* a problematic restriction in cases where a temp file's ownership is not immediately handed over without concurrent sharing. What happens to be a small window for the parallel CREATE INDEX patch probably wouldn't be a small window for parallel hash join. :-( It's not hard to see why I would like to do things this way. Just look at ResourceOwnerReleaseInternal(). Any release of a file happens during RESOURCE_RELEASE_AFTER_LOCKS, whereas the release of dynamic shared memory segments happens earlier, during RESOURCE_RELEASE_BEFORE_LOCKS. ISTM that the only sensible way to implement a refcount is using dynamic shared memory, and that seems hard. There are additional reasons why I suggest we go this way, such as the fact that all the relevant state belongs to BufFile, which is implemented a layer above all of the guts of resource management of temp files within fd.c. I'd have to replicate almost all state in fd.c to make it all work, which seems like a big modularity violation. Does anyone have any suggestions on how to tackle this? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Fri, Feb 3, 2017 at 4:15 PM, Thomas Munro wrote: >> I suspect that this system isn't particularly well balanced for the >> task of benchmarking the patch. You would probably see notably better >> scalability than any you've shown in any test if you could add >> additional sequential I/O bandwidth, which is probably an economical, >> practical choice for many users. I suspect that you aren't actually >> saturating available CPUs to the greatest extent that the >> implementations makes possible. > > I will look into what IO options I can access before running larger > tests. Also I will look into running the test with both cold and warm > caches (ie "echo 1 > /proc/sys/vm/drop_caches") so that read bandwidth > enters the picture. It might just have been that the table was too small to be an effective target for parallel sequential scan with so many workers, and so a presorted best case CREATE INDEX, which isn't that different, also fails to see much benefit (compared to what you'd see with a similar case involving a larger table). In other words, I might have jumped the gun in emphasizing issues with hardware and I/O bandwidth over issues around data volume (that I/O parallelism is inherently not very helpful with these relatively small tables). As I've pointed out a couple of times before, bigger sorts will be more CPU bound because sorting itself has costs that grow linearithmically, whereas writing out runs has costs that grow linearly. The relative cost of the I/O can be expected to go down as input goes up for this reason. At the same time, a larger input might make better use of I/O parallelism, which reduces the cost paid in latency to write out runs in absolute terms. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Sat, Feb 4, 2017 at 11:58 AM, Peter Geoghegan wrote: > On Fri, Feb 3, 2017 at 5:04 AM, Thomas Munro > wrote: >> 1. 'asc' = pre-sorted data (w = 0 shows time in seconds, other columns >> show speed-up relative to that time): >> >> mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8 >> -++---+---+---+---+---+---+---+--- >>1 | 119.97 | 4.61x | 4.83x | 5.32x | 5.61x | 5.88x | 6.10x | 6.18x | 6.09x >> 64 | 19.42 | 1.18x | 1.10x | 1.23x | 1.23x | 1.16x | 1.19x | 1.20x | 1.21x >> 256 | 18.35 | 1.02x | 0.92x | 0.98x | 1.02x | 1.06x | 1.07x | 1.08x | 1.10x >> 512 | 17.75 | 1.01x | 0.89x | 0.95x | 0.99x | 1.02x | 1.05x | 1.06x | 1.07x > > I think that this presorted case doesn't improve much because the > sorting itself is so cheap, as explained in my last mail. However, the > improvements as workers are added is still smaller than expected. I > think that this indicates that there isn't enough I/O capacity > available here to truly show the full potential of the patch -- I've > certainly seen better scalability for cases like this when there is a > lot of I/O bandwidth available, and I/O parallelism is there to be > taken advantage of. Say, when using a system with a large RAID array > (I used a RAID0 array with 12 HDDs for my own tests). Another issue is > that you probably don't have enough data here to really show off the > patch. I don't want to dismiss the benchmark, which is still quite > informative, but it's worth pointing out that the feature is going to > be most compelling for very large indexes, that will take at least > several minutes to build under any circumstances. (Having a > reproducible case is also important, which what you have here has > going for it, on the other hand.) Right. My main reason for starting smallish was to allow me to search a space with several variables without waiting eons. Next I would like to run a small subset of those tests with, say, 10, 20 or even 100 times more data loaded, so the tables would be ~20GB, ~40GB or ~200GB. About read bandwidth: It shouldn't have been touching the disk at all for reads: I did a dummy run of the index build before the measured runs, so that a 2GB table being sorted in ~2 minutes would certainly have come entirely from the OS page cache since the machine has oodles of RAM. About write bandwidth: The WAL, the index and the temp files all went to an SSD array, though I don't have the characteristics of that to hand. I should also be able to test on multi-spindle HDD array. I doubt either can touch your 12 way RAID0 array, but will look into that. > I suspect that this system isn't particularly well balanced for the > task of benchmarking the patch. You would probably see notably better > scalability than any you've shown in any test if you could add > additional sequential I/O bandwidth, which is probably an economical, > practical choice for many users. I suspect that you aren't actually > saturating available CPUs to the greatest extent that the > implementations makes possible. I will look into what IO options I can access before running larger tests. Also I will look into running the test with both cold and warm caches (ie "echo 1 > /proc/sys/vm/drop_caches") so that read bandwidth enters the picture. > Another thing I want to point out is that with 1MB of > maintenance_work_mem, the patch appears to do very well, but that > isn't terribly meaningful. I would suggest that we avoid testing this > patch with such a low amount of memory -- it doesn't seem important. > This is skewed by the fact that you're using replacement selection in > the serial case only. I think what this actually demonstrates is that > replacement selection is very slow, even with its putative best case. > I believe that commit 2459833 was the final nail in the coffin of > replacement selection. I certainly don't want to relitigate the > discussion on replacement_sort_tuples, and am not going to push too > hard, but ISTM that we should fully remove replacement selection from > tuplesort.c and be done with it. Interesting. I haven't grokked this but will go and read about it. Based on your earlier comments about banana skin effects, I'm wondering if it would be interesting to add a couple more heap distributions to the test set that are almost completely sorted except for a few entries out of order. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Fri, Feb 3, 2017 at 5:04 AM, Thomas Munro wrote: > 1. 'asc' = pre-sorted data (w = 0 shows time in seconds, other columns > show speed-up relative to that time): > > mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8 > -++---+---+---+---+---+---+---+--- >1 | 119.97 | 4.61x | 4.83x | 5.32x | 5.61x | 5.88x | 6.10x | 6.18x | 6.09x > 64 | 19.42 | 1.18x | 1.10x | 1.23x | 1.23x | 1.16x | 1.19x | 1.20x | 1.21x > 256 | 18.35 | 1.02x | 0.92x | 0.98x | 1.02x | 1.06x | 1.07x | 1.08x | 1.10x > 512 | 17.75 | 1.01x | 0.89x | 0.95x | 0.99x | 1.02x | 1.05x | 1.06x | 1.07x I think that this presorted case doesn't improve much because the sorting itself is so cheap, as explained in my last mail. However, the improvements as workers are added is still smaller than expected. I think that this indicates that there isn't enough I/O capacity available here to truly show the full potential of the patch -- I've certainly seen better scalability for cases like this when there is a lot of I/O bandwidth available, and I/O parallelism is there to be taken advantage of. Say, when using a system with a large RAID array (I used a RAID0 array with 12 HDDs for my own tests). Another issue is that you probably don't have enough data here to really show off the patch. I don't want to dismiss the benchmark, which is still quite informative, but it's worth pointing out that the feature is going to be most compelling for very large indexes, that will take at least several minutes to build under any circumstances. (Having a reproducible case is also important, which what you have here has going for it, on the other hand.) I suspect that this system isn't particularly well balanced for the task of benchmarking the patch. You would probably see notably better scalability than any you've shown in any test if you could add additional sequential I/O bandwidth, which is probably an economical, practical choice for many users. I suspect that you aren't actually saturating available CPUs to the greatest extent that the implementations makes possible. Another thing I want to point out is that with 1MB of maintenance_work_mem, the patch appears to do very well, but that isn't terribly meaningful. I would suggest that we avoid testing this patch with such a low amount of memory -- it doesn't seem important. This is skewed by the fact that you're using replacement selection in the serial case only. I think what this actually demonstrates is that replacement selection is very slow, even with its putative best case. I believe that commit 2459833 was the final nail in the coffin of replacement selection. I certainly don't want to relitigate the discussion on replacement_sort_tuples, and am not going to push too hard, but ISTM that we should fully remove replacement selection from tuplesort.c and be done with it. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Fri, Feb 3, 2017 at 5:04 AM, Thomas Munro wrote: > I applied your v2 patch on top of > 7ac4a389a7dbddaa8b19deb228f0a988e79c5795^ to avoid a conflict. It > still had a couple of harmless conflicts that I was able to deal with > (not code, just some header stuff moving around). You must mean my V7 patch. FWIW, I've resolved the conflicts with 7ac4a389a7dbddaa8b19deb228f0a988e79c5795 in my own private branch, and have worked through some of the open items that you raised. > See full results from all permutations attached, but I wanted to > highlight the measurements from 'textwide', 'u', 'nonu' which show > interesting 'asc' numbers (data already sorted). The 'mem' column is > maintenance_work_mem in megabytes. The 'w = 0' column shows the time > in seconds for parallel_workers = 0. The other 'w = N' columns show > times with higher parallel_workers settings, represented as speed-up > relative to the 'w = 0' time. The thing to keep in mind about testing presorted cases in tuplesort in general is that we have this weird precheck for presorted input in our qsort. This is something added by us to the original Bentley & McIlroy algorithm in 2006. I am very skeptical of this addition, in general. It tends to have the effect of highly distorting how effective most optimizations are for presorted cases, which comes up again and again. It only works when the input is *perfectly* presorted, but can throw away an enormous amount of work when the last tuple of input is out of order -- that will throw away all work before that point (not so bad when you think your main cost is comparisons rather than memory accesses, but that isn't the case). Your baseline case can either be made unrealistically fast due to the fact that you get a perfectly sympathetic case for this optimization, or unrealistically slow (very CPU bound) due to the fact that you have that one last tuple out of place. I once said that this last tuple can act like a discarded banana skin. There is nothing wrong with the idea of exploiting presortedness, and to some extent the original algorithm does that (by using insertion sort), but an optimization along the lines of Timsort's "galloping mode" (which is what this modification of ours attempts) requires non-trivial bookkeeping to do right. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Wed, Feb 1, 2017 at 8:46 PM, Peter Geoghegan wrote: > On Tue, Jan 31, 2017 at 11:23 PM, Thomas Munro > wrote: >> So I'm in favour of this patch, which is relatively simple and give us >> faster index builds soon. Eventually we might also be able to have >> approach 1. From what I gather, it's entirely possible that we might >> still need 2 to fall back on in some cases. > > Right. And it can form the basis of an implementation of 1, which in > any case seems to be much more compelling for parallel query, when a > great deal more can be pushed down, and we are not particularly likely > to be I/O bound (usually not much writing to the heap, or WAL > logging). I ran some tests today. First I created test tables representing the permutations of these choices: Table structure: int = Integer key only intwide = Integer key + wide row text = Text key only (using dictionary words) textwide = Text key + wide row Uniqueness: u = each value unique d = 10 duplicates of each value Heap physical order: rand = Random asc = Ascending order (already sorted) desc = Descending order (sorted backwards) I used 10 million rows for this test run, so that gave me 24 tables of the following sizes as reported in "\d+": int tables = 346MB each intwide tables = 1817MB each text tables = 441MB each textwide tables = 1953MB each It'd be interesting to test larger tables of course but I had a lot of permutations to get through. For each of those tables I ran tests corresponding to the permutations of these three variables: Index type: uniq = CREATE UNIQUE INDEX ("u" tables only, ie no duplicates) nonu = CREATE INDEX ("u" and "d" tables) Maintenance memory: 1M, 64MB, 256MB, 512MB Workers: from 0 up to 8 Environment: EDB test machine "cthulhu", Intel(R) Xeon(R) CPU E7- 8830 @ 2.13GHz, 8 socket, 8 cores (16 threads) per socket, CentOS 7.2, Linux kernel 3.10.0-229.7.2.el7.x86_64, 512GB RAM, pgdata on SSD. Database initialised with en_US.utf-8 collation, all defaults except max_wal_size increased to 4GB (otherwise warnings about too frequent checkpoints) and max_parallel_workers_maintenance = 8. Testing done with warm OS cache. I applied your v2 patch on top of 7ac4a389a7dbddaa8b19deb228f0a988e79c5795^ to avoid a conflict. It still had a couple of harmless conflicts that I was able to deal with (not code, just some header stuff moving around). See full results from all permutations attached, but I wanted to highlight the measurements from 'textwide', 'u', 'nonu' which show interesting 'asc' numbers (data already sorted). The 'mem' column is maintenance_work_mem in megabytes. The 'w = 0' column shows the time in seconds for parallel_workers = 0. The other 'w = N' columns show times with higher parallel_workers settings, represented as speed-up relative to the 'w = 0' time. 1. 'asc' = pre-sorted data (w = 0 shows time in seconds, other columns show speed-up relative to that time): mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8 -++---+---+---+---+---+---+---+--- 1 | 119.97 | 4.61x | 4.83x | 5.32x | 5.61x | 5.88x | 6.10x | 6.18x | 6.09x 64 | 19.42 | 1.18x | 1.10x | 1.23x | 1.23x | 1.16x | 1.19x | 1.20x | 1.21x 256 | 18.35 | 1.02x | 0.92x | 0.98x | 1.02x | 1.06x | 1.07x | 1.08x | 1.10x 512 | 17.75 | 1.01x | 0.89x | 0.95x | 0.99x | 1.02x | 1.05x | 1.06x | 1.07x 2. 'rand' = randomised data: mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8 -++---+---+---+---+---+---+---+--- 1 | 130.25 | 1.82x | 2.19x | 2.52x | 2.58x | 2.72x | 2.72x | 2.83x | 2.89x 64 | 117.36 | 1.80x | 2.20x | 2.43x | 2.47x | 2.55x | 2.51x | 2.59x | 2.69x 256 | 124.68 | 1.87x | 2.20x | 2.49x | 2.52x | 2.64x | 2.70x | 2.72x | 2.75x 512 | 115.77 | 1.51x | 1.72x | 2.14x | 2.08x | 2.19x | 2.31x | 2.44x | 2.48x 3. 'desc' = reverse-sorted data: mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8 -++---+---+---+---+---+---+---+--- 1 | 115.19 | 1.88x | 2.39x | 2.78x | 3.50x | 3.62x | 4.20x | 4.19x | 4.39x 64 | 112.17 | 1.85x | 2.25x | 2.99x | 3.63x | 3.65x | 4.01x | 4.31x | 4.62x 256 | 119.55 | 1.76x | 2.21x | 2.85x | 3.43x | 3.37x | 3.77x | 4.24x | 4.28x 512 | 119.50 | 1.85x | 2.19x | 2.87x | 3.26x | 3.28x | 3.74x | 4.24x | 3.93x The 'asc' effects are much less pronounced when the key is an int. Here is the equivalent data for 'intwide', 'u', 'nonu': 1. 'asc' mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8 -+---+---+---+---+---+---+---+---+--- 1 | 12.19 | 1.55x | 1.93x | 2.21x | 2.44x | 2.64x | 2.76x | 2.91x | 2.83x 64 | 7.35 | 1.29x | 1.53x | 1.69x | 1.86x | 1.98x | 2.04x | 2.07x | 2.09x 256 | 7.34 | 1.26x | 1.47x | 1.64x | 1.79x | 1.92x | 1.96x | 1.98x | 2.02x 512 | 7.24 | 1.24x | 1.46x | 1.65x | 1.80x
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Tue, Jan 31, 2017 at 11:23 PM, Thomas Munro wrote: > 2. All participants: parallel sequential scan, sort, spool to disk; > barrier; leader: merge spooled tuples and build btree. > > This patch is doing the 2nd thing. My understanding is that some > systems might choose to do that if they don't have or don't like the > table's statistics, since repartitioning for balanced load requires > carefully chosen ranges and is highly sensitive to distribution > problems. The second thing here seems to offer comparable scalability to other system implementation's of the first thing. They seem to have reused "partitioning to sort in parallel" for B-Tree builds, at least in some cases, despite this. WAL logging is the biggest serial bottleneck here for other systems, I've heard -- that's still going to be pretty much serial. I think that the fact that some systems do partitioning for parallel B-Tree builds might have as much to do with their ability to create B-Tree indexes in place as anything else. Apparently, some systems don't use temp files, instead writing out what is for all intents and purposes part of a finished B-Tree as runs (no use of temp_tablespaces). That may be a big part of what makes it worthwhile to try to use partitioning. I understand that only the highest client counts will see much direct performance benefit relative to the first approach. > It's pretty clear that approach 1 is a difficult project. From my > research into dynamic repartitioning in the context of hash joins, I > can see that that infrastructure is a significant project in its own > right: subproblems include super efficient tuple exchange, buffering, > statistics/planning and dealing with/adapting to bad outcomes. I also > suspect that repartitioning operators might need to be specialised for > different purposes like sorting vs hash joins, which may have > differing goals. I think it's probably easy to build a slow dynamic > repartitioning mechanism that frequently results in terrible worst > case scenarios where you paid a fortune in IPC overheads and still > finished up with one worker pulling most of the whole load. Without > range partitioning, I don't believe you can merge the resulting > non-disjoint btrees efficiently so you'd probably finish up writing a > complete new btree to mash them together. As for merging disjoint > btrees, I assume there are ways to do a structure-preserving merge > that just rebuilds some internal pages and incorporates the existing > leaf pages directly, a bit like tree manipulation in functional > programming languages; that'll take some doing. I agree with all that. "Stitching together" disjoint B-Trees does seem to have some particular risks, which users of other systems are cautioned against in their documentation. You can end up with an unbalanced B-Tree. > So I'm in favour of this patch, which is relatively simple and give us > faster index builds soon. Eventually we might also be able to have > approach 1. From what I gather, it's entirely possible that we might > still need 2 to fall back on in some cases. Right. And it can form the basis of an implementation of 1, which in any case seems to be much more compelling for parallel query, when a great deal more can be pushed down, and we are not particularly likely to be I/O bound (usually not much writing to the heap, or WAL logging). > Will you move the BufFile changes to a separate patch in the next revision? That is the plan. I need to get set up with a new machine here, having given back my work laptop to Heroku, but it shouldn't take too long. Thanks for the review. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Wed, Feb 1, 2017 at 5:37 PM, Michael Paquier wrote: > On Tue, Jan 31, 2017 at 2:15 PM, Peter Geoghegan wrote: >> On Mon, Jan 30, 2017 at 8:46 PM, Thomas Munro >> wrote: >>> On Wed, Jan 4, 2017 at 12:53 PM, Peter Geoghegan wrote: Attached is V7 of the patch. >>> >>> I am doing some testing. First, some superficial things from first pass: >>> >>> [Various minor cosmetic issues] >> >> Oops. > > As this review is very recent, I have moved the patch to CF 2017-03. ParallelContext * -CreateParallelContext(parallel_worker_main_type entrypoint, int nworkers) +CreateParallelContext(parallel_worker_main_type entrypoint, int nworkers, + bool serializable_okay) { MemoryContext oldcontext; ParallelContext *pcxt; @@ -143,7 +144,7 @@ CreateParallelContext(parallel_worker_main_type entrypoint, int nworkers) * workers, at least not until somebody enhances that mechanism to be * parallel-aware. */ - if (IsolationIsSerializable()) + if (IsolationIsSerializable() && !serializable_okay) nworkers = 0; That's a bit weird but I can't think of a problem with it. Workers run with MySerializableXact == InvalidSerializableXact, even though they may have the snapshot of a SERIALIZABLE leader. Hopefully soon the restriction on SERIALIZABLE in parallel queries can be lifted anyway, and then this could be removed. Here are some thoughts on the overall approach. Disclaimer: I haven't researched the state of the art in parallel sort or btree builds. But I gather from general reading that there are a couple of well known approaches, and I'm sure you'll correct me if I'm off base here. 1. All participants: parallel sequential scan, repartition on the fly so each worker has tuples in a non-overlapping range, sort, build disjoint btrees; barrier; leader: merge disjoint btrees into one. 2. All participants: parallel sequential scan, sort, spool to disk; barrier; leader: merge spooled tuples and build btree. This patch is doing the 2nd thing. My understanding is that some systems might choose to do that if they don't have or don't like the table's statistics, since repartitioning for balanced load requires carefully chosen ranges and is highly sensitive to distribution problems. It's pretty clear that approach 1 is a difficult project. From my research into dynamic repartitioning in the context of hash joins, I can see that that infrastructure is a significant project in its own right: subproblems include super efficient tuple exchange, buffering, statistics/planning and dealing with/adapting to bad outcomes. I also suspect that repartitioning operators might need to be specialised for different purposes like sorting vs hash joins, which may have differing goals. I think it's probably easy to build a slow dynamic repartitioning mechanism that frequently results in terrible worst case scenarios where you paid a fortune in IPC overheads and still finished up with one worker pulling most of the whole load. Without range partitioning, I don't believe you can merge the resulting non-disjoint btrees efficiently so you'd probably finish up writing a complete new btree to mash them together. As for merging disjoint btrees, I assume there are ways to do a structure-preserving merge that just rebuilds some internal pages and incorporates the existing leaf pages directly, a bit like tree manipulation in functional programming languages; that'll take some doing. So I'm in favour of this patch, which is relatively simple and give us faster index builds soon. Eventually we might also be able to have approach 1. From what I gather, it's entirely possible that we might still need 2 to fall back on in some cases. Will you move the BufFile changes to a separate patch in the next revision? Still testing and reviewing, more soon. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Tue, Jan 31, 2017 at 2:15 PM, Peter Geoghegan wrote: > On Mon, Jan 30, 2017 at 8:46 PM, Thomas Munro > wrote: >> On Wed, Jan 4, 2017 at 12:53 PM, Peter Geoghegan wrote: >>> Attached is V7 of the patch. >> >> I am doing some testing. First, some superficial things from first pass: >> >> [Various minor cosmetic issues] > > Oops. As this review is very recent, I have moved the patch to CF 2017-03. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Tue, Jan 31, 2017 at 12:15 AM, Peter Geoghegan wrote: >> Should this 64KB minimum be mentioned in the documentation? > > You mean user-visible documentation, and not just tuplesort.h? I don't > think that that's necessary. That's a ludicrously low amount of memory > for a worker to be limited to anyway. It will never come up with > remotely sensible use of the feature. I agree. >> + if (!btspool->isunique) >> + { >> + shm_toc_estimate_keys(&pcxt->estimator, 2); >> + } >> >> Project style: people always tell me to drop the curlies in cases like >> that. There are a few more examples in the patch. > > I only do this when there is an "else" that must have curly braces, > too. There are plenty of examples of this from existing code, so I > think it's fine. But I disagree on this one. I think if (blah) stuff(); else { thing(); gargle(); } ...is much better than if (blah) { stuff(); } else { thing(); gargle(); } But if there were a comment on a separate line before the call to stuff(), then I would do it the second way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Mon, Jan 30, 2017 at 8:46 PM, Thomas Munro wrote: > On Wed, Jan 4, 2017 at 12:53 PM, Peter Geoghegan wrote: >> Attached is V7 of the patch. > > I am doing some testing. First, some superficial things from first pass: > > [Various minor cosmetic issues] Oops. > Just an observation: if you ask for a large number of workers, but > only one can be launched, it will be constrained to a small fraction > of maintenance_work_mem, but use only one worker. That's probably OK, > and I don't see how to do anything about it unless you are prepared to > make workers wait for an initial message from the leader to inform > them how many were launched. Actually, the leader-owned worker Tuplesort state will have the appropriate amount, so you'd still need to have 2 participants (1 worker + leader-as-worker). And, sorting is much less sensitive to having a bit less memory than hashing (at least when there isn't dozens of runs to merge in the end, or multiple passes). So, I agree that this isn't worth worrying about for a DDL statement. > Should this 64KB minimum be mentioned in the documentation? You mean user-visible documentation, and not just tuplesort.h? I don't think that that's necessary. That's a ludicrously low amount of memory for a worker to be limited to anyway. It will never come up with remotely sensible use of the feature. > + if (!btspool->isunique) > + { > + shm_toc_estimate_keys(&pcxt->estimator, 2); > + } > > Project style: people always tell me to drop the curlies in cases like > that. There are a few more examples in the patch. I only do this when there is an "else" that must have curly braces, too. There are plenty of examples of this from existing code, so I think it's fine. > + /* Wait for workers */ > + ConditionVariableSleep(&shared->workersFinishedCv, > + WAIT_EVENT_PARALLEL_FINISH); > > I don't think we should reuse WAIT_EVENT_PARALLEL_FINISH in > tuplesort_leader_wait and worker_wait. That belongs to > WaitForParallelWorkersToFinish, so someone who see that in > pg_stat_activity won't know which it is. Noted. > IIUC worker_wait() is only being used to keep the worker around so its > files aren't deleted. Once buffile cleanup is changed to be > ref-counted (in an on_dsm_detach hook?) then workers might as well > exit sooner, freeing up a worker slot... do I have that right? Yes. Or at least I think it's very likely that that will end up happening. > Incidentally, barrier.c could probably be used for this > synchronisation instead of these functions. I think > _bt_begin_parallel would call BarrierInit(&shared->barrier, > scantuplesortstates) and then after LaunchParallelWorkers() it'd call > a new interface BarrierDetachN(&shared->barrier, scantuplesortstates - > pcxt->nworkers_launched) to forget about workers that failed to > launch. Then you could use BarrierWait where the leader waits for the > workers to finish, and BarrierDetach where the workers are finished > and want to exit. I thought about doing that, actually, but I don't like creating dependencies on some other uncommited patch, which is a moving target (barrier stuff isn't committed yet). It makes life difficult for reviewers. I put off adopting condition variables until they were committed for the same reason -- it's was easy to do without them for a time. I'll probably get around to it before too long, but feel no urgency about it. Barriers will only allow me to make a modest net removal of code, AFAIK. Thanks -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Wed, Jan 4, 2017 at 12:53 PM, Peter Geoghegan wrote: > Attached is V7 of the patch. I am doing some testing. First, some superficial things from first pass: Still applies with some offsets and one easy-to-fix rejected hunk in nbtree.c (removing some #include directives and a struct definition). +/* Sort parallel code from state for sort__start probes */ +#define PARALLEL_SORT(state) ((state)->shared == NULL ? 0 : \ +(state)->workerNum >= 0 : 1 : 2) Typo: ':' instead of '?', --enable-dtrace build fails. + the entire utlity command, regardless of the number of Typo: s/utlity/utility/ + /* Perform sorting of spool, and possibly a spool2 */ + sortmem = Max(maintenance_work_mem / btshared->scantuplesortstates, 64); Just an observation: if you ask for a large number of workers, but only one can be launched, it will be constrained to a small fraction of maintenance_work_mem, but use only one worker. That's probably OK, and I don't see how to do anything about it unless you are prepared to make workers wait for an initial message from the leader to inform them how many were launched. Should this 64KB minimum be mentioned in the documentation? + if (!btspool->isunique) + { + shm_toc_estimate_keys(&pcxt->estimator, 2); + } Project style: people always tell me to drop the curlies in cases like that. There are a few more examples in the patch. + /* Wait for workers */ + ConditionVariableSleep(&shared->workersFinishedCv, + WAIT_EVENT_PARALLEL_FINISH); I don't think we should reuse WAIT_EVENT_PARALLEL_FINISH in tuplesort_leader_wait and worker_wait. That belongs to WaitForParallelWorkersToFinish, so someone who see that in pg_stat_activity won't know which it is. IIUC worker_wait() is only being used to keep the worker around so its files aren't deleted. Once buffile cleanup is changed to be ref-counted (in an on_dsm_detach hook?) then workers might as well exit sooner, freeing up a worker slot... do I have that right? Incidentally, barrier.c could probably be used for this synchronisation instead of these functions. I think _bt_begin_parallel would call BarrierInit(&shared->barrier, scantuplesortstates) and then after LaunchParallelWorkers() it'd call a new interface BarrierDetachN(&shared->barrier, scantuplesortstates - pcxt->nworkers_launched) to forget about workers that failed to launch. Then you could use BarrierWait where the leader waits for the workers to finish, and BarrierDetach where the workers are finished and want to exit. + /* Prepare state to create unified tapeset */ + leaderTapes = palloc(sizeof(TapeShare) * state->maxTapes); Missing cast (TapeShare *) here? Project style judging by code I've seen, and avoids gratuitously C++ incompatibility. +_bt_parallel_shared_estimate(Snapshot snapshot) ... +tuplesort_estimate_shared(int nWorkers) Inconsistent naming? More soon. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Tue, Dec 20, 2016 at 5:14 PM, Peter Geoghegan wrote: >> Imagine a data structure that is stored in dynamic shared memory and >> contains space for a filename, a reference count, and a mutex. Let's >> call this thing a SharedTemporaryFile or something like that. It >> offers these APIs: >> >> extern void SharedTemporaryFileInitialize(SharedTemporaryFile *); >> extern void SharedTemporaryFileAttach(SharedTemporary File *, dsm_segment >> *seg); >> extern void SharedTemporaryFileAssign(SharedTemporyFile *, char *pathname); >> extern File SharedTemporaryFileGetFile(SharedTemporaryFile *); > > I'm a little bit tired right now, and I have yet to look at Thomas' > parallel hash join patch in any detail. I'm interested in what you > have to say here, but I think that I need to learn more about its > requirements in order to have an informed opinion. Attached is V7 of the patch. The overall emphasis with this revision is on bringing clarity on how much can be accomplished using generalized infrastructure, explaining the unification mechanism coherently, and related issues. Notable changes --- * Rebased to work with the newly simplified logtape.c representation (the recent removal of "indirect blocks" by Heikki). Heikki's work was something that helped with simplifying the whole unification mechanism, to a significant degree. I think that there was over a 50% reduction in logtape.c lines of code in this revision. * randomAccess cases are now able to reclaim disk space from blocks originally written by workers. This further simplifies logtape.c changes significantly. I don't think that this is important because some future randomAccess caller might otherwise have double the storage overhead for their parallel sort, or even because of the disproportionate performance penalty such a caller would experience; rather, it's important because it removes previous special cases (that were internal to logtape.c). For example, aside from the fact that worker tapes within a unified tapeset will often have a non-zero offset, there is no state that actually remembers that this is a unified tapeset, because that isn't needed anymore. And, even though we reclaim blocks from workers, we only have one central chokepoint for applying worker offsets in the leader (that chokepoint is ltsReadFillBuffer()). Routines tasked with things like positional seeking for mark/restore for certain tuplesort clients (which are. in general, poorly tested) now need to have no knowledge of unification while still working just the same. This is a consequence of the fact that ltsWriteBlock() callers (and ltsWriteBlock() itself) never have to think about offsets. I'm pretty happy about that. * pg_restore now prevents the planner from deciding that parallelism should be used, in order to make restoration behavior more consistent and predictable. Iff a dump being restored happens to have a CREATE INDEX with the new index storage parameter parallel_workers set, then pg_restore will use parallel CREATE INDEX. This is accomplished with a new GUC, enable_parallelddl (since "max_parallel_workers_maintenance = 0" will disable parallel CREATE INDEX across the board, ISTM that a second new GUC is required). I think that this behavior the right trade-off for pg_restore goes, although I still don't feel particularly strongly about it. There is now a concrete proposal on what to do about pg_restore, if nothing else. To recap, the general concern address here is that there are typically no ANALYZE stats available for the planner to decide with when pg_restore runs CREATE INDEX, although that isn't always true, which was both surprising and inconsistent. * Addresses the problem of anonymous record types and their need for "remapping" across parallel workers. I've simply pushed the responsibility on callers within tuplesort.h contract; parallel CREATE INDEX callers don't need to care about this, as explained there. (CLUSTER tuplesorts would also be safe.) * Puts the whole rationale for unification into one large comment above the function BufFileUnify(), and removes traces of the same kind of discussion from everywhere else. I think that buffile.c is the right central place to discuss the unification mechanism, now that logtape.c has been greatly simplified. All the fd.c changes are in routines that are only ever called by buffile.c anyway, and are not too complicated (in general, temp fd.c files are only ever owned transitively, through BufFiles). So, morally, the unification mechanism is something that wholly belongs to buffile.c, since unification is all about temp files, and buffile.h is the interface through which temp files are owned and accessed in general, without exception. Unification remains specialized --- On the one hand, BufFileUnify() now describes the whole idea of unification in detail, in its own general terms, including its performance characteristics, but on the other hand it doesn't pretend to be
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Wed, Dec 21, 2016 at 10:21 AM, Peter Geoghegan wrote: > On Wed, Dec 21, 2016 at 6:00 AM, Robert Haas wrote: >> 3. Just live with the waste of space. > > I am loathe to create a special case for the parallel interface too, > but I think it's possible that *no* caller will ever actually need to > live with this restriction at any time in the future. I just realized that you were actually talking about the waste of space in workers here, as opposed to the theoretical waste of space that would occur in the leader should there ever be a parallel randomAccess tuplesort caller. To be clear, I am totally against allowing a waste of logtape.c temp file space in *workers*, because that implies a cost that will most certainly be felt by users all the time. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Wed, Dec 21, 2016 at 6:00 AM, Robert Haas wrote: > 3. Just live with the waste of space. I am loathe to create a special case for the parallel interface too, but I think it's possible that *no* caller will ever actually need to live with this restriction at any time in the future. I am strongly convinced that adopting tuplesort.c for parallelism should involve partitioning [1]. With that approach, even randomAccess callers will not want to read at random for one big materialized tape, since that's at odds with the whole point of partitioning, which is to remove any dependencies between workers quickly and early, so that as much work as possible is pushed down into workers. If a merge join were performed in a world where we have this kind of partitioning, we definitely wouldn't require one big materialized tape that is accessible within each worker. What are the chances of any real user actually having to live with the waste of space at some point in the future? > Another tangentially-related problem I just realized is that we need > to somehow handle the issues that tqueue.c does when transferring > tuples between backends -- most of the time there's no problem, but if > anonymous record types are involved then tuples require "remapping". > It's probably harder to provoke a failure in the tuplesort case than > with parallel query per se, but it's probably not impossible. Thanks for pointing that out. I'll look into it. BTW, I discovered a bug where there is very low memory available within each worker -- tuplesort.c throws an error within workers immediately. It's just a matter of making sure that they at least have 64KB of workMem, which is a pretty straightforward fix. Obviously it makes no sense to use so little memory in the first place; this is a corner case. [1] https://www.postgresql.org/message-id/CAM3SWZR+ATYAzyMT+hm-Bo=1l1smtjbndtibwbtktyqs0dy...@mail.gmail.com -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Wed, Dec 21, 2016 at 7:04 AM, Heikki Linnakangas wrote: >> If the worker is always completely finished with the tape before the >> leader touches it, couldn't the leader's LogicalTapeSet just "adopt" >> the tape and overwrite it like any other? > > Currently, the logical tape code assumes that all tapes in a single > LogicalTapeSet are allocated from the same BufFile. The logical tape's > on-disk format contains block numbers, to point to the next/prev block of > the tape [1], and they're assumed to refer to the same file. That allows > reusing space efficiently during the merge. After you have read the first > block from tapes A, B and C, you can immediately reuse those three blocks > for output tape D. I see. Hmm. > Now, if you read multiple tapes, from different LogicalTapeSet, hence backed > by different BufFiles, you cannot reuse the space from those different tapes > for a single output tape, because the on-disk format doesn't allow referring > to blocks in other files. You could reuse the space of *one* of the input > tapes, by placing the output tape in the same LogicalTapeSet, but not all of > them. > > We could enhance that, by using "filename + block number" instead of just > block number, in the pointers in the logical tapes. Then you could spread > one logical tape across multiple files. Probably not worth it in practice, > though. OK, so the options as I understand them are: 1. Enhance the logical tape set infrastructure in the manner you mention, to support filename (or more likely a proxy for filename) + block number in the logical tape pointers. Then, tapes can be transferred from one LogicalTapeSet to another. 2. Enhance the BufFile infrastructure to support some notion of a shared BufFile so that multiple processes can be reading and writing blocks in the same BufFile. Then, extend the logical tape infrastruture so that we also have the notion of a shared LogicalTape. This means that things like ltsGetFreeBlock() need to be re-engineered to handle concurrency with other backends. 3. Just live with the waste of space. I would guess that (1) is easier than (2). Also, (2) might provoke contention while writing tapes that is otherwise completely unnecessary. It seems silly to have multiple backends fighting over the same end-of-file pointer for the same file when they could just write to different files instead. Another tangentially-related problem I just realized is that we need to somehow handle the issues that tqueue.c does when transferring tuples between backends -- most of the time there's no problem, but if anonymous record types are involved then tuples require "remapping". It's probably harder to provoke a failure in the tuplesort case than with parallel query per se, but it's probably not impossible. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On 12/21/2016 12:53 AM, Robert Haas wrote: That leaves one problem, though: reusing space in the final merge phase. If the tapes being merged belong to different LogicalTapeSets, and create one new tape to hold the result, the new tape cannot easily reuse the space of the input tapes because they are on different tape sets. If the worker is always completely finished with the tape before the leader touches it, couldn't the leader's LogicalTapeSet just "adopt" the tape and overwrite it like any other? Currently, the logical tape code assumes that all tapes in a single LogicalTapeSet are allocated from the same BufFile. The logical tape's on-disk format contains block numbers, to point to the next/prev block of the tape [1], and they're assumed to refer to the same file. That allows reusing space efficiently during the merge. After you have read the first block from tapes A, B and C, you can immediately reuse those three blocks for output tape D. Now, if you read multiple tapes, from different LogicalTapeSet, hence backed by different BufFiles, you cannot reuse the space from those different tapes for a single output tape, because the on-disk format doesn't allow referring to blocks in other files. You could reuse the space of *one* of the input tapes, by placing the output tape in the same LogicalTapeSet, but not all of them. We could enhance that, by using "filename + block number" instead of just block number, in the pointers in the logical tapes. Then you could spread one logical tape across multiple files. Probably not worth it in practice, though. [1] As the code stands, there are no next/prev pointers, but a tree of "indirect" blocks. But I'm planning to change that to simpler next/prev pointers, in https://www.postgresql.org/message-id/flat/55b3b7ae-8dec-b188-b8eb-e07604052351%40iki.fi - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Tue, Dec 20, 2016 at 8:14 PM, Peter Geoghegan wrote: > Without meaning to sound glib, unification is the process by which > parallel CREATE INDEX has the leader read temp files from workers > sufficient to complete its final on-the-fly merge. That's not glib, but you can't in the end define BufFile unification in terms of what parallel CREATE INDEX needs. Whatever changes we make to lower-level abstractions in the service of some higher-level goal need to be explainable on their own terms. >>It's fine to make up new words -- indeed, in some sense that is the essence >> of writing any complex problem -- but you have to define them. > > I invite you to help me define this new word. If at some point I'm able to understand what it means, I'll try to do that. I think you're loosely using "unification" to mean combining stuff from different backends in some way that depends on the particular context, so that "BufFile unification" can be different from "LogicalTape unification". But that's just punting the question of what each of those things actually are. > Maybe it's inferior to that, but I think what Heikki proposes is more > or less complementary to what I've proposed, and has nothing to do > with resource management and plenty to do with making the logtape.c > interface look nice, AFAICT. It's also about refactoring/simplifying > logtape.c itself, while we're at it. I believe that Heikki has yet to > comment either way on my approach to resource management, one aspect > of the patch that I was particularly keen on your looking into. My reading of Heikki's point was that there's not much point in touching the BufFile level of things if we can do all of the necessary stuff at the LogicalTape level, and I agree with him about that. If a shared BufFile had a shared read-write pointer, that would be a good justification for having it. But it seems like unification at the BufFile level is just concatenation, and that can be done just as well at the LogicalTape level, so why tinker with BufFile? As I've said, I think there's some low-level hacking needed here to make sure files get removed at the correct time in all cases, but apart from that I see no good reason to push the concatenation operation all the way down into BufFile. > The theory of operation here is that workers own their own BufFiles, > and are responsible for deleting them when they die. The assumption, > rightly or wrongly, is that it's sufficient that workers flush > everything out (write out temp files), and yield control to the > leader, which will open their temp files for the duration of the > leader final on-the-fly merge. The resource manager in the leader > knows it isn't supposed to ever delete worker-owned files (just > close() the FDs), and the leader errors if it cannot find temp files > that match what it expects. If there is a an error in the leader, it > shuts down workers, and they clean up, more than likely. If there is > an error in the worker, or if the files cannot be deleted (e.g., if > there is a classic hard crash scenario), we should also be okay, > because nobody will trip up on some old temp file from some worker, > since fd.c has some gumption about what workers need to do (and what > the leader needs to avoid) in the event of a hard crash. I don't see a > risk of file descriptor leaks, which may or may not have been part of > your concern (please clarify). I don't think there's any new issue with file descriptor leaks here, but I think there is a risk of calling unlink() too early or too late with your design. My proposal was an effort to nail that down real tight. >> If the worker is always completely finished with the tape before the >> leader touches it, couldn't the leader's LogicalTapeSet just "adopt" >> the tape and overwrite it like any other? > > I'll remind you that parallel CREATE INDEX doesn't actually ever need > to be randomAccess, and so we are not actually going to ever need to > do this as things stand. I wrote the code that way in order to not > break the existing interface, which seemed like a blocker to posting > the patch. I am open to the idea of such an "adoption" occurring, even > though it actually wouldn't help any case that exists in the patch as > proposed. I didn't go that far in part because it seemed premature, > given that nobody had looked at my work to date at the time, and given > the fact that there'd be no initial user-visible benefit, and given > how the exact meaning of "unification" was (and is) somewhat in flux. > > I see no good reason to not do that, although that might change if I > actually seriously undertook to teach the leader about this kind of > "adoption". I suspect that the interface specification would make for > confusing reading, which isn't terribly appealing, but I'm sure I > could manage to make it work given time. I think the interface is pretty clear: the worker's logical tapes get incorporated into the leader's LogicalTapeSet as if they'd bee
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Tue, Dec 20, 2016 at 2:53 PM, Robert Haas wrote: >> What I have in mind is something like the attached patch. It refactors >> LogicalTapeRead(), LogicalTapeWrite() etc. functions to take a LogicalTape >> as argument, instead of LogicalTapeSet and tape number. LogicalTapeSet >> doesn't have the concept of a tape number anymore, it can contain any number >> of tapes, and you can create more on the fly. With that, it'd be fairly easy >> to make tuplesort.c merge LogicalTapes that came from different tape sets, >> backed by different BufFiles. I think that'd avoid much of the unification >> code. > > I just looked at the buffile.c/buffile.h changes in the latest version > of the patch and I agree with this criticism, though maybe not with > the proposed solution. I actually don't understand what "unification" > is supposed to mean. The patch really doesn't explain that anywhere > that I can see. It says stuff like: > > + * Parallel operations can use an interface to unify multiple worker-owned > + * BufFiles and a leader-owned BufFile within a leader process. This relies > + * on various fd.c conventions about the naming of temporary files. Without meaning to sound glib, unification is the process by which parallel CREATE INDEX has the leader read temp files from workers sufficient to complete its final on-the-fly merge. So, it's a terminology that's bit like "speculative insertion" was up until UPSERT was committed: a concept that is somewhat in flux, and describes a new low-level mechanism built to support a higher level operation, which must accord with a higher level set of requirements (so, for speculative insertion, that would be avoiding "unprincipled deadlocks" and so on). That being the case, maybe "unification" isn't useful as an precise piece of terminology at this point, but that will change. While I'm fairly confident that I basically have the right idea with this patch, I think that you are better at judging the ins and outs of resource management than I am, not least because of the experience of working on parallel query itself. Also, I'm signed up to review parallel hash join in large part because I think there might be some convergence concerning the sharing of BufFiles among parallel workers. I don't think I'm qualified to judge what a general abstraction like this should look like, but I'm trying to get there. > That comment tells you that unification is a thing you can do -- via > an unspecified interface for unspecified reasons using unspecified > conventions -- but it doesn't tell you what the semantics of it are > supposed to be. For example, if we "unify" several BufFiles, do they > then have a shared seek pointer? No. > Do the existing contents effectively > get concatenated in an unpredictable order, or are they all expected > to be empty at the time unification happens? Or something else? The order is the same order as ordinal identifiers are assigned to workers within tuplesort.c, which is undefined, with the notable exception of the leader's own identifier (-1) and area of the unified BufFile space (this is only relevant in randomAccess cases, where leader may write stuff out to its own reserved part of the BufFile space). It only matters that the bit of metadata in shared memory is in that same order, which it clearly will be. So, it's unpredictable, but in the same way that ordinal identifiers are assigned in a not-well-defined order; it doesn't or at least shouldn't matter. We can imagine a case where it does matter, and we probably should, but that case isn't parallel CREATE INDEX. >It's fine to make up new words -- indeed, in some sense that is the essence > of writing any complex problem -- but you have to define them. I invite you to help me define this new word. > It's hard to understand how something like this doesn't leak > resources. Maybe that's been thought about here, but it isn't very > clear to me how it's supposed to work. I agree that it would be useful to centrally document what all this unification stuff is about. Suggestions on where that should live are welcome. > In Heikki's proposal, if > process A is trying to read a file owned by process B, and process B > dies and removes the file before process A gets around to reading it, > we have got trouble, especially on Windows which apparently has low > tolerance for such things. Peter's proposal avoids that - I *think* - > by making the leader responsible for all resource cleanup, but that's > inferior to the design we've used for other sorts of shared resource > cleanup (DSM, DSA, shm_mq, lock groups) where the last process to > detach always takes responsibility. Maybe it's inferior to that, but I think what Heikki proposes is more or less complementary to what I've proposed, and has nothing to do with resource management and plenty to do with making the logtape.c interface look nice, AFAICT. It's also about refactoring/simplifying logtape.c itself, while we're at it. I believe that Heik
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Wed, Sep 21, 2016 at 12:52 PM, Heikki Linnakangas wrote: > I find this unification business really complicated. I think it'd be simpler > to keep the BufFiles and LogicalTapeSets separate, and instead teach > tuplesort.c how to merge tapes that live on different > LogicalTapeSets/BufFiles. Or refactor LogicalTapeSet so that a single > LogicalTapeSet can contain tapes from different underlying BufFiles. > > What I have in mind is something like the attached patch. It refactors > LogicalTapeRead(), LogicalTapeWrite() etc. functions to take a LogicalTape > as argument, instead of LogicalTapeSet and tape number. LogicalTapeSet > doesn't have the concept of a tape number anymore, it can contain any number > of tapes, and you can create more on the fly. With that, it'd be fairly easy > to make tuplesort.c merge LogicalTapes that came from different tape sets, > backed by different BufFiles. I think that'd avoid much of the unification > code. I just looked at the buffile.c/buffile.h changes in the latest version of the patch and I agree with this criticism, though maybe not with the proposed solution. I actually don't understand what "unification" is supposed to mean. The patch really doesn't explain that anywhere that I can see. It says stuff like: + * Parallel operations can use an interface to unify multiple worker-owned + * BufFiles and a leader-owned BufFile within a leader process. This relies + * on various fd.c conventions about the naming of temporary files. That comment tells you that unification is a thing you can do -- via an unspecified interface for unspecified reasons using unspecified conventions -- but it doesn't tell you what the semantics of it are supposed to be. For example, if we "unify" several BufFiles, do they then have a shared seek pointer? Do the existing contents effectively get concatenated in an unpredictable order, or are they all expected to be empty at the time unification happens? Or something else? It's fine to make up new words -- indeed, in some sense that is the essence of writing any complex problem -- but you have to define them. As far as I can tell, the idea is that we're somehow magically concatenating the BufFiles into one big super-BufFile, but I'm fuzzy on exactly what's supposed to be going on there. It's hard to understand how something like this doesn't leak resources. Maybe that's been thought about here, but it isn't very clear to me how it's supposed to work. In Heikki's proposal, if process A is trying to read a file owned by process B, and process B dies and removes the file before process A gets around to reading it, we have got trouble, especially on Windows which apparently has low tolerance for such things. Peter's proposal avoids that - I *think* - by making the leader responsible for all resource cleanup, but that's inferior to the design we've used for other sorts of shared resource cleanup (DSM, DSA, shm_mq, lock groups) where the last process to detach always takes responsibility. That avoids assuming that we're always dealing with a leader-follower situation, it doesn't categorically require the leader to be the one who creates the shared resource, and it doesn't require the leader to be the last process to die. Imagine a data structure that is stored in dynamic shared memory and contains space for a filename, a reference count, and a mutex. Let's call this thing a SharedTemporaryFile or something like that. It offers these APIs: extern void SharedTemporaryFileInitialize(SharedTemporaryFile *); extern void SharedTemporaryFileAttach(SharedTemporary File *, dsm_segment *seg); extern void SharedTemporaryFileAssign(SharedTemporyFile *, char *pathname); extern File SharedTemporaryFileGetFile(SharedTemporaryFile *); After setting aside sizeof(SharedTemporaryFile) bytes in your shared DSM sgement, you call SharedTemporaryFileInitialize() to initialize them. Then, every process that cares about the file does SharedTemporaryFileAttach(), which bumps the reference count and sets an on_dsm_detach hook to decrement the reference count and unlink the file if the reference count thereby reaches 0. One of those processes does SharedTemporaryFileAssign(), which fills in the pathname and clears FD_TEMPORARY. Then, any process that has attached can call SharedTemporaryFileGetFile() to get a File which can then be accessed normally. So, the pattern for parallel sort would be: - Leader sets aside space and calls SharedTemporaryFileInitialize() and SharedTemporaryFileAttach(). - The cooperating worker calls SharedTemporaryFileAttach() and then SharedTemporaryFileAssign(). - The leader then calls SharedTemporaryFileGetFile(). Since the leader can attach to the file before the path name is filled in, there's no window where the file is at risk of being leaked. Before SharedTemporaryFileAssign(), the worker is solely responsible for removing the file; after that call, whichever of the leader and the worker exits last will remove the file.
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Mon, Dec 5, 2016 at 7:44 AM, Peter Geoghegan wrote: > On Sat, Dec 3, 2016 at 7:23 PM, Tomas Vondra > wrote: > > I do share your concerns about unpredictable behavior - that's > > particularly worrying for pg_restore, which may be used for time- > > sensitive use cases (DR, migrations between versions), so unpredictable > > changes in behavior / duration are unwelcome. > > Right. > > > But isn't this more a deficiency in pg_restore, than in CREATE INDEX? > > The issue seems to be that the reltuples value may or may not get > > updated, so maybe forcing ANALYZE (even very low statistics_target > > values would do the trick, I think) would be more appropriate solution? > > Or maybe it's time add at least some rudimentary statistics into the > > dumps (the reltuples field seems like a good candidate). > > I think that there is a number of reasonable ways of looking at it. It > might also be worthwhile to have a minimal ANALYZE performed by CREATE > INDEX directly, iff there are no preexisting statistics (there is > definitely going to be something pg_restore-like that we cannot fix -- > some ETL tool, for example). Perhaps, as an additional condition to > proceeding with such an ANALYZE, it should also only happen when there > is any chance at all of parallelism being used (but then you get into > having to establish the relation size reliably in the absence of any > pg_class.relpages, which isn't very appealing when there are many tiny > indexes). > > In summary, I would really like it if a consensus emerged on how > parallel CREATE INDEX should handle the ecosystem of tools like > pg_restore, reindexdb, and so on. Personally, I'm neutral on which > general approach should be taken. Proposals from other hackers about > what to do here are particularly welcome. > > Moved to next CF with "needs review" status. Regards, Hari Babu Fujitsu Australia
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Sat, Dec 3, 2016 at 7:23 PM, Tomas Vondra wrote: > I do share your concerns about unpredictable behavior - that's > particularly worrying for pg_restore, which may be used for time- > sensitive use cases (DR, migrations between versions), so unpredictable > changes in behavior / duration are unwelcome. Right. > But isn't this more a deficiency in pg_restore, than in CREATE INDEX? > The issue seems to be that the reltuples value may or may not get > updated, so maybe forcing ANALYZE (even very low statistics_target > values would do the trick, I think) would be more appropriate solution? > Or maybe it's time add at least some rudimentary statistics into the > dumps (the reltuples field seems like a good candidate). I think that there is a number of reasonable ways of looking at it. It might also be worthwhile to have a minimal ANALYZE performed by CREATE INDEX directly, iff there are no preexisting statistics (there is definitely going to be something pg_restore-like that we cannot fix -- some ETL tool, for example). Perhaps, as an additional condition to proceeding with such an ANALYZE, it should also only happen when there is any chance at all of parallelism being used (but then you get into having to establish the relation size reliably in the absence of any pg_class.relpages, which isn't very appealing when there are many tiny indexes). In summary, I would really like it if a consensus emerged on how parallel CREATE INDEX should handle the ecosystem of tools like pg_restore, reindexdb, and so on. Personally, I'm neutral on which general approach should be taken. Proposals from other hackers about what to do here are particularly welcome. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Sat, 2016-12-03 at 18:37 -0800, Peter Geoghegan wrote: > On Sat, Dec 3, 2016 at 5:45 PM, Alvaro Herrera com> wrote: > > > > I don't think a patch must necessarily consider all possible uses > > that > > the new feature may have. If we introduce parallel index creation, > > that's great; if pg_restore doesn't start using it right away, > > that's > > okay. You, or somebody else, can still patch it later. The patch > > is > > still a step forward. > While I agree, right now pg_restore will tend to use or not use > parallelism for CREATE INDEX more or less by accident, based on > whether or not pg_class.reltuples has already been set by something > else (e.g., an earlier CREATE INDEX against the same table in the > restoration). That seems unacceptable. I haven't just suppressed the > use of parallel CREATE INDEX within pg_restore because that would be > taking a position on something I have a hard time defending any > particular position on. And so, I am slightly concerned about the > entire ecosystem of tools that could implicitly use parallel CREATE > INDEX, with undesirable consequences. Especially pg_restore. > > It's not so much a hard question as it is an awkward one. I want to > handle any possible objection about there being future compatibility > issues with going one way or the other ("This paints us into a corner > with..."). And, there is no existing, simple way for pg_restore and > other tools to disable the use of parallelism due to the cost model > automatically kicking in, while still allowing the proposed new index > storage parameter ("parallel_workers") to force the use of > parallelism, which seems like something that should happen. (I might > have to add a new GUC like "enable_maintenance_paralleism", since > "max_parallel_workers_maintenance = 0" disables parallelism no matter > how it might be invoked). I do share your concerns about unpredictable behavior - that's particularly worrying for pg_restore, which may be used for time- sensitive use cases (DR, migrations between versions), so unpredictable changes in behavior / duration are unwelcome. But isn't this more a deficiency in pg_restore, than in CREATE INDEX? The issue seems to be that the reltuples value may or may not get updated, so maybe forcing ANALYZE (even very low statistics_target values would do the trick, I think) would be more appropriate solution? Or maybe it's time add at least some rudimentary statistics into the dumps (the reltuples field seems like a good candidate). Trying to fix this by adding more GUCs seems a bit strange to me. > > In general, I have a positive outlook on this patch, since it appears > to compete well with similar implementations in other systems > scalability-wise. It does what it's supposed to do. > +1 to that -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Sat, Dec 3, 2016 at 5:45 PM, Alvaro Herrera wrote: > I don't think a patch must necessarily consider all possible uses that > the new feature may have. If we introduce parallel index creation, > that's great; if pg_restore doesn't start using it right away, that's > okay. You, or somebody else, can still patch it later. The patch is > still a step forward. While I agree, right now pg_restore will tend to use or not use parallelism for CREATE INDEX more or less by accident, based on whether or not pg_class.reltuples has already been set by something else (e.g., an earlier CREATE INDEX against the same table in the restoration). That seems unacceptable. I haven't just suppressed the use of parallel CREATE INDEX within pg_restore because that would be taking a position on something I have a hard time defending any particular position on. And so, I am slightly concerned about the entire ecosystem of tools that could implicitly use parallel CREATE INDEX, with undesirable consequences. Especially pg_restore. It's not so much a hard question as it is an awkward one. I want to handle any possible objection about there being future compatibility issues with going one way or the other ("This paints us into a corner with..."). And, there is no existing, simple way for pg_restore and other tools to disable the use of parallelism due to the cost model automatically kicking in, while still allowing the proposed new index storage parameter ("parallel_workers") to force the use of parallelism, which seems like something that should happen. (I might have to add a new GUC like "enable_maintenance_paralleism", since "max_parallel_workers_maintenance = 0" disables parallelism no matter how it might be invoked). In general, I have a positive outlook on this patch, since it appears to compete well with similar implementations in other systems scalability-wise. It does what it's supposed to do. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Peter Geoghegan wrote: > On Mon, Nov 7, 2016 at 8:28 PM, Peter Geoghegan wrote: > > What do we need to teach pg_restore about parallel CREATE INDEX, if > > anything at all? Could this be as simple as a blanket disabling of > > parallelism for CREATE INDEX from pg_restore? Or, does it need to be > > more sophisticated than that? I suppose that tools like reindexdb and > > pgbench must be considered in a similar way. > > I still haven't resolved this question, which seems like the most > important outstanding question, I don't think a patch must necessarily consider all possible uses that the new feature may have. If we introduce parallel index creation, that's great; if pg_restore doesn't start using it right away, that's okay. You, or somebody else, can still patch it later. The patch is still a step forward. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Mon, Nov 7, 2016 at 8:28 PM, Peter Geoghegan wrote: > What do we need to teach pg_restore about parallel CREATE INDEX, if > anything at all? Could this be as simple as a blanket disabling of > parallelism for CREATE INDEX from pg_restore? Or, does it need to be > more sophisticated than that? I suppose that tools like reindexdb and > pgbench must be considered in a similar way. I still haven't resolved this question, which seems like the most important outstanding question, but I attach V6. Changes: * tuplesort.c was adapted to use the recently committed condition variables stuff. This made things cleaner. No more ad-hoc WaitLatch() looping. * Adapted docs to mention the newly committed max_parallel_workers GUC in the context of discussing proposed max_parallel_workers_maintenance GUC. * Fixed trivial assertion failure bug that could be tripped when a conventional sort uses very little memory. -- Peter Geoghegan 0002-Add-temporary-testing-tools.patch.gz Description: GNU Zip compressed data 0001-Add-parallel-B-tree-index-build-sorting.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Wed, Nov 9, 2016 at 10:18 PM, Peter Geoghegan wrote: >> Maybe another way of putting this is that, while there's clearly a >> benefit to having some kind of a cap, it's appropriate to pick a large >> value, such as 500. Having no cap at all risks creating many extra >> tapes that just waste memory, and also risks an unduly >> cache-inefficient final merge. Reigning that in makes sense. >> However, we can't reign it in too far or we'll create slow polyphase >> merges in case that are reasonably likely to occur in real life. > > I completely agree with your analysis. Cool. BTW, my run with 10 tapes completed in 10696528.377 ms (02:58:16.528) - i.e. almost 3 minutes slower than with no tape limit. Building runs took 4260.16 s, and the final merge pass began at 8239.12 s. That's certainly better than I expected, and it seems to show that even if the number of tapes is grossly inadequate for the number of runs, you can still make up most of the time that you lose to I/O with improved cache efficiency -- at least under favorable circumstances. Of course, on many systems I/O bandwidth will be a scarce resource, so that argument can be overdone -- and even if not, a 10-tape sort version takes FAR longer to deliver the first tuple. I also tried this out with work_mem = 512MB. Doubling work_mem reduces the number of runs enough that we don't get a polyphase merge in any case. With no limit on tapes: 2016-11-10 11:24:45 UTC [54042] LOG: switching to external sort with 1873 tapes: CPU: user: 11.34 s, system: 0.48 s, elapsed: 12.13 s 2016-11-10 12:36:22 UTC [54042] LOG: finished writing run 308 to tape 307: CPU: user: 4096.63 s, system: 156.88 s, elapsed: 4309.66 s 2016-11-10 12:36:22 UTC [54042] LOG: using 516563 KB of memory for read buffers among 308 input tapes 2016-11-10 12:36:30 UTC [54042] LOG: performsort done (except 308-way final merge): CPU: user: 4097.75 s, system: 157.24 s, elapsed: 4317.76 s 2016-11-10 13:54:07 UTC [54042] LOG: external sort ended, 6255577 disk blocks used: CPU: user: 8638.72 s, system: 177.42 s, elapsed: 8974.44 s With a max_sort_tapes = 501: 2016-11-10 14:23:50 UTC [54042] LOG: switching to external sort with 502 tapes: CPU: user: 10.99 s, system: 0.54 s, elapsed: 11.57 s 2016-11-10 15:36:47 UTC [54042] LOG: finished writing run 278 to tape 277: CPU: user: 4190.31 s, system: 155.33 s, elapsed: 4388.86 s 2016-11-10 15:36:47 UTC [54042] LOG: using 517313 KB of memory for read buffers among 278 input tapes 2016-11-10 15:36:54 UTC [54042] LOG: performsort done (except 278-way final merge): CPU: user: 4191.36 s, system: 155.68 s, elapsed: 4395.66 s 2016-11-10 16:53:39 UTC [54042] LOG: external sort ended, 6255699 disk blocks used: CPU: user: 8673.07 s, system: 175.93 s, elapsed: 9000.80 s 0.3% slower with the tape limit, but that might be noise. Even if not, it seems pretty silly to create 1873 tapes when we only need ~300. At work_mem = 2GB: 2016-11-10 18:08:00 UTC [54042] LOG: switching to external sort with 7490 tapes: CPU: user: 44.28 s, system: 1.99 s, elapsed: 46.33 s 2016-11-10 19:23:06 UTC [54042] LOG: finished writing run 77 to tape 76: CPU: user: 4342.10 s, system: 156.21 s, elapsed: 4551.95 s 2016-11-10 19:23:06 UTC [54042] LOG: using 2095202 KB of memory for read buffers among 77 input tapes 2016-11-10 19:23:12 UTC [54042] LOG: performsort done (except 77-way final merge): CPU: user: 4343.36 s, system: 157.07 s, elapsed: 4558.79 s 2016-11-10 20:24:24 UTC [54042] LOG: external sort ended, 6255946 disk blocks used: CPU: user: 7894.71 s, system: 176.36 s, elapsed: 8230.13 s At work_mem = 2GB, max_sort_tapes = 501: 2016-11-10 21:28:23 UTC [54042] LOG: switching to external sort with 502 tapes: CPU: user: 44.09 s, system: 1.94 s, elapsed: 46.07 s 2016-11-10 22:42:28 UTC [54042] LOG: finished writing run 68 to tape 67: CPU: user: 4278.49 s, system: 154.39 s, elapsed: 4490.25 s 2016-11-10 22:42:28 UTC [54042] LOG: using 2095427 KB of memory for read buffers among 68 input tapes 2016-11-10 22:42:34 UTC [54042] LOG: performsort done (except 68-way final merge): CPU: user: 4279.60 s, system: 155.21 s, elapsed: 4496.83 s 2016-11-10 23:42:10 UTC [54042] LOG: external sort ended, 6255983 disk blocks used: CPU: user: 7733.98 s, system: 173.85 s, elapsed: 8072.55 s Roughly 2% faster. Maybe still noise, but less likely. 7490 tapes certainly seems over the top. At work_mem = 8GB: 2016-11-14 19:17:28 UTC [54042] LOG: switching to external sort with 29960 tapes: CPU: user: 183.80 s, system: 7.71 s, elapsed: 191.61 s 2016-11-14 20:32:02 UTC [54042] LOG: finished writing run 20 to tape 19: CPU: user: 4431.44 s, system: 176.82 s, elapsed: 4665.16 s 2016-11-14 20:32:02 UTC [54042] LOG: using 8388083 KB of memory for read buffers among 20 input tapes 2016-11-14 20:32:26 UTC [54042] LOG: performsort done (except 20-way final merge): CPU: user: 4432.99 s, system: 181.29 s, elapsed: 4689.52 s 2016-11-14 21:30:56 UTC [54042] LOG: external sort ended, 6256
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Wed, Nov 9, 2016 at 6:57 PM, Robert Haas wrote: > I guess that's possible, but the problem with polyphase merge is that > the increased I/O becomes a pretty significant cost in a hurry. Not if you have a huge RAID array. :-) Obviously I'm not seriously suggesting that we revise the cap from 500 to 7. We're only concerned about the constant factors here. There is a clearly a need to make some simplifying assumptions. I think that you understand this very well, though. > Maybe another way of putting this is that, while there's clearly a > benefit to having some kind of a cap, it's appropriate to pick a large > value, such as 500. Having no cap at all risks creating many extra > tapes that just waste memory, and also risks an unduly > cache-inefficient final merge. Reigning that in makes sense. > However, we can't reign it in too far or we'll create slow polyphase > merges in case that are reasonably likely to occur in real life. I completely agree with your analysis. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Wed, Nov 9, 2016 at 7:54 PM, Peter Geoghegan wrote: >> Now, on the other hand, as far as I can see, the actual amount of >> evidence that 0001 is a good idea which has been presented in this >> forum is pretty near zero. You've argued for it on theoretical >> grounds several times, but theoretical arguments are not a substitute >> for test results. > > See the illustration in TAOCP, vol III, page 273 in the second edition > -- "Fig. 70. Efficiency of Polyphase merge using Algorithm D". I think > that it's actually a real-world benchmark. I don't have that publication, and I'm guessing that's not based on PostgreSQL's implementation. There's no substitute for tests using the code we've actually got. >> So, I'm now feeling pretty bullish about this patch, except for one >> thing, which is that I think the comments are way off-base. Peter >> writes: $When allowedMem is significantly lower than what is required >> for an internal sort, it is unlikely that there are benefits to >> increasing the number of tapes beyond Knuth's "sweet spot" of 7.$ >> I'm pretty sure that's totally wrong, first of all because commit >> df700e6b40195d28dc764e0c694ac8cef90d4638 improved performance by doing >> precisely the thing which this comment says we shouldn't > > It's more complicated than that. As I said, I think that Knuth > basically had it right with his sweet spot of 7. I think that commit > df700e6b40195d28dc764e0c694ac8cef90d4638 was effective in large part > because a one-pass merge avoided certain overheads not inherent to > polyphase merge, like all that memory accounting stuff, extra palloc() > traffic, etc. The expanded use of per tape buffering we have even in > multi-pass cases likely makes that much less true for us these days. > > The reason I haven't actually gone right back down to 7 with this cap > is that it's possible that the added I/O costs outweigh the CPU costs > in extreme cases, even though I think that polyphase merge doesn't > have all that much to do with I/O costs, even with its 1970s > perspective. Knuth doesn't say much about I/O costs -- it's more about > using an extremely small amount of memory effectively (minimizing CPU > costs with very little available main memory). > > Furthermore, not limiting ourselves to 7 tapes and seeing a benefit > (benefitting from a few dozen or hundred instead) seems more possible > with the improved merge heap maintenance logic added recently, where > there could be perhaps hundreds of runs merged with very low CPU cost > in the event of presorted input (or, input that is inversely > logically/physically correlated). That would be true because we'd only > examine the top of the heap through, and so I/O costs may matter much > more. > > Depending on the exact details, I bet you could see a benefit with > only 7 tapes due to CPU cache efficiency in a case like the one you > describe. Perhaps when sorting integers, but not when sorting collated > text. There are many competing considerations, which I've tried my > best to balance here with a merge order of 500. I guess that's possible, but the problem with polyphase merge is that the increased I/O becomes a pretty significant cost in a hurry. Here's the same test with max_sort_tapes = 100: 2016-11-09 23:02:49 UTC [48551] LOG: begin tuple sort: nkeys = 1, workMem = 262144, randomAccess = f 2016-11-09 23:02:55 UTC [48551] LOG: switching to external sort with 101 tapes: CPU: user: 5.72 s, system: 0.25 s, elapsed: 6.04 s 2016-11-10 00:13:00 UTC [48551] LOG: finished writing run 544 to tape 49: CPU: user: 4003.00 s, system: 156.89 s, elapsed: 4211.33 s 2016-11-10 00:16:52 UTC [48551] LOG: finished 51-way merge step: CPU: user: 4214.84 s, system: 161.94 s, elapsed: 4442.98 s 2016-11-10 00:25:41 UTC [48551] LOG: finished 100-way merge step: CPU: user: 4704.14 s, system: 170.83 s, elapsed: 4972.47 s 2016-11-10 00:36:47 UTC [48551] LOG: finished 99-way merge step: CPU: user: 5333.12 s, system: 179.94 s, elapsed: 5638.52 s 2016-11-10 00:45:32 UTC [48551] LOG: finished 99-way merge step: CPU: user: 5821.13 s, system: 189.00 s, elapsed: 6163.53 s 2016-11-10 01:01:29 UTC [48551] LOG: finished 100-way merge step: CPU: user: 6691.10 s, system: 210.60 s, elapsed: 7120.58 s 2016-11-10 01:01:29 UTC [48551] LOG: performsort done (except 100-way final merge): CPU: user: 6691.10 s, system: 210.60 s, elapsed: 7120.58 s 2016-11-10 01:45:40 UTC [48551] LOG: external sort ended, 6255949 disk blocks used: CPU: user: 9271.07 s, system: 232.26 s, elapsed: 9771.49 s This is already worse than max_sort_tapes = 501, though the total runtime is still better than no cap (the time-to-first-tuple is way worse, though). I'm going to try max_sort_tapes = 10 next, but I think the basic pattern is already fairly clear. As you reduce the cap on the number of tapes, (a) the time to build the initial runs doesn't change very much, (b) the time to perform the final merge decreases significantly, and (c) the time to perform the non
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Wed, Nov 9, 2016 at 4:54 PM, Peter Geoghegan wrote: > It's more complicated than that. As I said, I think that Knuth > basically had it right with his sweet spot of 7. I think that commit > df700e6b40195d28dc764e0c694ac8cef90d4638 was effective in large part > because a one-pass merge avoided certain overheads not inherent to > polyphase merge, like all that memory accounting stuff, extra palloc() > traffic, etc. The expanded use of per tape buffering we have even in > multi-pass cases likely makes that much less true for us these days. Also, logtape.c fragmentation made multiple merge pass cases experience increased random I/O in a way that was only an accident of our implementation. We've fixed that now, but that problem must have added further cost that df700e6b40195d28dc764e0c694ac8cef90d4638 *masked* when it was commited in 2006. (I do think that the problem with the merge heap maintenance fixed recently in 24598337c8d214ba8dcf354130b72c49636bba69 was the biggest problem that the 2006 work masked, though). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Wed, Nov 9, 2016 at 4:01 PM, Robert Haas wrote: > I gather that 0001, which puts a cap on the number of tapes, is not > actually related to the subject of this thread; it's an independent > change that you think is a good idea. I reviewed the previous > discussion on this topic upthread, between you and Heikki, which seems > to me to contain more heat than light. FWIW, I don't remember it that way. Heikki seemed to be uncomfortable with the quasi-arbitrary choice of constant, rather than disagreeing with the general idea of a cap. Or, maybe he thought I didn't go far enough, by completely removing polyphase merge. I think that removing polyphase merge would be an orthogonal change to this, though. > Now, on the other hand, as far as I can see, the actual amount of > evidence that 0001 is a good idea which has been presented in this > forum is pretty near zero. You've argued for it on theoretical > grounds several times, but theoretical arguments are not a substitute > for test results. See the illustration in TAOCP, vol III, page 273 in the second edition -- "Fig. 70. Efficiency of Polyphase merge using Algorithm D". I think that it's actually a real-world benchmark. I guess I felt that no one ever argued that as many tapes as possible was sound on any grounds, even theoretical, and so didn't feel obligated to test it until asked to do so. I think that the reason that a cap like this didn't go in around the time that the growth logic went in (2006) was because nobody followed up on it. If you look at the archives, there is plenty of discussion of a cap like this at the time. > That looks very good. On a test that runs for almost 3 hours, we > saved more than 20 minutes. The overall runtime improvement is 23% in > a case where we would not expect this patch to do particularly well; > after all, without limiting the number of runs, we are able to > complete the sort with a single merge pass, whereas when we reduce the > number of runs, we now require a polyphase merge. Nevertheless, we > come out way ahead, because the final merge pass gets way faster, > presumably because there are fewer tapes involved. The first test > does a 616-way final merge and takes 6184.34 seconds to do it. The > second test does a 501-way final merge and takes 4519.31 seconds to > do. This increased final merge speed accounts for practically all of > the speedup, and the reason it's faster pretty much has to be that > it's merging fewer tapes. It's CPU cache efficiency -- has to be. > That, in turn, happens for two reasons. First, because limiting the > number of tapes increases slightly the memory available for storing > the tuples belonging to each run, we end up with fewer runs in the > first place. The number of runs drops from from 616 to 577, about a > 7% reduction. Second, because we have more runs than tapes in the > second case, it does a 77-way merge prior to the final merge. Because > of that 77-way merge, the time at which the second run starts > producing tuples is slightly later. Instead of producing the first > tuple at 70:47.71, we have to wait until 75:72.22. That's a small > disadvantage in this case, because it's hypothetically possible that a > query like this could have a LIMIT and we'd end up worse off overall. > However, that's pretty unlikely, for three reasons. Number one, LIMIT > isn't likely to be used on queries of this type in the first place. > Number two, if it were used, we'd probably end up with a bounded sort > plan which would be way faster anyway. Number three, if somehow we > still sorted the data set we'd still win in this case if the limit > were more than about 20% of the total number of tuples. The much > faster run time to produce the whole data set is a small price to pay > for possibly needing to wait a little longer for the first tuple. Cool. > So, I'm now feeling pretty bullish about this patch, except for one > thing, which is that I think the comments are way off-base. Peter > writes: $When allowedMem is significantly lower than what is required > for an internal sort, it is unlikely that there are benefits to > increasing the number of tapes beyond Knuth's "sweet spot" of 7.$ > I'm pretty sure that's totally wrong, first of all because commit > df700e6b40195d28dc764e0c694ac8cef90d4638 improved performance by doing > precisely the thing which this comment says we shouldn't It's more complicated than that. As I said, I think that Knuth basically had it right with his sweet spot of 7. I think that commit df700e6b40195d28dc764e0c694ac8cef90d4638 was effective in large part because a one-pass merge avoided certain overheads not inherent to polyphase merge, like all that memory accounting stuff, extra palloc() traffic, etc. The expanded use of per tape buffering we have even in multi-pass cases likely makes that much less true for us these days. The reason I haven't actually gone right back down to 7 with this cap is that it's possible that the added I/O cos
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Mon, Nov 7, 2016 at 11:28 PM, Peter Geoghegan wrote: > I attach V5. I gather that 0001, which puts a cap on the number of tapes, is not actually related to the subject of this thread; it's an independent change that you think is a good idea. I reviewed the previous discussion on this topic upthread, between you and Heikki, which seems to me to contain more heat than light. At least in my opinion, the question is not whether a limit on the number of tapes is the best possible system, but rather whether it's better than the status quo. It's silly to refuse to make a simple change on the grounds that some much more complex change might be better, because if somebody writes that patch and it is better we can always revert 0001 then. If 0001 involved hundreds of lines of invasive code changes, that argument wouldn't apply, but it doesn't; it's almost a one-liner. Now, on the other hand, as far as I can see, the actual amount of evidence that 0001 is a good idea which has been presented in this forum is pretty near zero. You've argued for it on theoretical grounds several times, but theoretical arguments are not a substitute for test results. Therefore, I decided that the best thing to do was test it myself. I wrote a little patch to add a GUC for max_sort_tapes, which actually turns out not to work as I thought: setting max_sort_tapes = 501 seems to limit the highest tape number to 501 rather than the number of tapes to 501, so there's a sort of off-by-one error. But that doesn't really matter. The patch is attached here for the convenience of anyone else who may want to fiddle with this. Next, I tried to set things up so that I'd get a large enough number of tapes for the cap to matter. To do that, I initialized with "pgbench -i --unlogged-tables -s 2" so that I had 2 billion tuples. Then I used this SQL query: "select sum(w+abalance) from (select (aid::numeric * 7123000217)%10 w, * from pgbench_accounts order by 1) x". The point of the math is to perturb the ordering of the tuples so that they actually need to be sorted instead of just passed through unchanged. The use of abalance in the outer sum prevents an index-only-scan from being used, which makes the sort wider; perhaps I should have tried to make it wider still, but this is what I did. I wanted to have more than 501 tapes because, obviously, a concern with a change like this is that things might get slower in the case where it forces a polyphase merge rather than a single merge pass. And, of course, I set trace_sort = on. Here's what my initial run looked like, in brief: 2016-11-09 15:37:52 UTC [44026] LOG: begin tuple sort: nkeys = 1, workMem = 262144, randomAccess = f 2016-11-09 15:37:59 UTC [44026] LOG: switching to external sort with 937 tapes: CPU: user: 5.51 s, system: 0.27 s, elapsed: 6.56 s 2016-11-09 16:48:31 UTC [44026] LOG: finished writing run 616 to tape 615: CPU: user: 4029.17 s, system: 152.72 s, elapsed: 4238.54 s 2016-11-09 16:48:31 UTC [44026] LOG: using 246719 KB of memory for read buffers among 616 input tapes 2016-11-09 16:48:39 UTC [44026] LOG: performsort done (except 616-way final merge): CPU: user: 4030.30 s, system: 152.98 s, elapsed: 4247.41 s 2016-11-09 18:33:30 UTC [44026] LOG: external sort ended, 6255145 disk blocks used: CPU: user: 10214.64 s, system: 175.24 s, elapsed: 10538.06 s And according to psql: Time: 10538068.225 ms (02:55:38.068) Then I set max_sort_tapes = 501 and ran it again. This time: 2016-11-09 19:05:22 UTC [44026] LOG: begin tuple sort: nkeys = 1, workMem = 262144, randomAccess = f 2016-11-09 19:05:28 UTC [44026] LOG: switching to external sort with 502 tapes: CPU: user: 5.69 s, system: 0.26 s, elapsed: 6.13 s 2016-11-09 20:15:20 UTC [44026] LOG: finished writing run 577 to tape 75: CPU: user: 3993.81 s, system: 153.42 s, elapsed: 4198.52 s 2016-11-09 20:15:20 UTC [44026] LOG: using 249594 KB of memory for read buffers among 501 input tapes 2016-11-09 20:21:19 UTC [44026] LOG: finished 77-way merge step: CPU: user: 4329.50 s, system: 160.67 s, elapsed: 4557.22 s 2016-11-09 20:21:19 UTC [44026] LOG: performsort done (except 501-way final merge): CPU: user: 4329.50 s, system: 160.67 s, elapsed: 4557.22 s 2016-11-09 21:38:12 UTC [44026] LOG: external sort ended, 6255484 disk blocks used: CPU: user: 8848.81 s, system: 182.64 s, elapsed: 9170.62 s And this one, according to psql: Time: 9170629.597 ms (02:32:50.630) That looks very good. On a test that runs for almost 3 hours, we saved more than 20 minutes. The overall runtime improvement is 23% in a case where we would not expect this patch to do particularly well; after all, without limiting the number of runs, we are able to complete the sort with a single merge pass, whereas when we reduce the number of runs, we now require a polyphase merge. Nevertheless, we come out way ahead, because the final merge pass gets way faster, presumably because there are fewer tapes involved. The first test does a 616-
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Mon, Oct 24, 2016 at 6:17 PM, Peter Geoghegan wrote: >> * Cost model. Should probably attempt to guess final index size, and >> derive calculation of number of workers from that. Also, I'm concerned >> that I haven't given enough thought to the low end, where with default >> settings most CREATE INDEX statements will use at least one parallel >> worker. > While I haven't made progress on any of these open items, I should > still get a version out that applies cleanly on top of git tip -- > commit b75f467b6eec0678452fd8d7f8d306e6df3a1076 caused the patch to > bitrot. I attach V4, which is a fairly mechanical rebase of V3, with > no notable behavioral changes or bug fixes. I attach V5. Changes: * A big cost model overhaul. Workers are logarithmically scaled based on projected final *index* size, not current heap size, as was the case in V4. A new nbtpage.c routine is added to estimate a not-yet-built B-Tree index's size, now called by the optimizer. This involves getting average item width for indexed attributes from pg_attribute for the heap relation. There are some subtleties here with partial indexes, null_frac, etc. I also refined the cap applied on the number of workers that limits too many workers being launched when there isn't so much maintenance_work_mem. The cost model is much improved now -- it is now more than just a placeholder, at least. It doesn't do things like launch a totally inappropriate number of workers to build a very small partial index. Granted, those workers would still have something to do -- scan the heap -- but not enough to justify launching so many (that is, launching as many as would be launched for an equivalent non-partial index). That having been said, things are still quite fudged here, and I struggle to find any guiding principle around doing better on average. I think that that's because of the inherent difficulty of modeling what's going on, but I'd be happy to be proven wrong on that. In any case, I think it's going to be fairly common for DBAs to want to use the storage parameter to force the use of a particular number of parallel workers. (See also: my remarks below on how the new bt_estimate_nblocks() SQL-callable function can give insight into the new cost model's decisions.) * Overhauled leader_mergeruns() further, to make it closer to mergeruns(). We now always rewind input tapes. This simplification involved refining some of the assertions within logtape.c, which is also slightly simplified. * 2 new testing tools are added to the final commit in the patch series (not actually proposed for commit). I've added 2 new SQL-callable functions to contrib/pageinspect. The 2 new testing functions are: bt_estimate_nblocks --- bt_estimate_nblocks() provides an easy way to see the optimizer's projection of how large the final index will be. It returns an estimate in blocks. Example: mgd=# analyze; ANALYZE mgd=# select oid::regclass as rel, bt_estimated_nblocks(oid), relpages, to_char(bt_estimated_nblocks(oid)::numeric / relpages, 'FM990.990') as estimate_actual from pg_class where relkind = 'i' order by relpages desc limit 20; rel │ bt_estimated_nblocks │ relpages │ estimate_actual ┼──┼──┼─ mgd.acc_accession_idx_accid│ 107,091 │ 106,274 │ 1.008 mgd.acc_accession_0│ 169,024 │ 106,274 │ 1.590 mgd.acc_accession_1│ 169,024 │ 80,382 │ 2.103 mgd.acc_accession_idx_prefixpart │ 76,661 │ 80,382 │ 0.954 mgd.acc_accession_idx_mgitype_key │ 76,661 │ 76,928 │ 0.997 mgd.acc_accession_idx_clustered│ 76,661 │ 76,928 │ 0.997 mgd.acc_accession_idx_createdby_key│ 76,661 │ 76,928 │ 0.997 mgd.acc_accession_idx_numericpart │ 76,661 │ 76,928 │ 0.997 mgd.acc_accession_idx_logicaldb_key│ 76,661 │ 76,928 │ 0.997 mgd.acc_accession_idx_modifiedby_key │ 76,661 │ 76,928 │ 0.997 mgd.acc_accession_pkey │ 76,661 │ 76,928 │ 0.997 mgd.mgi_relationship_property_idx_propertyname_key │ 74,197 │ 74,462 │ 0.996 mgd.mgi_relationship_property_idx_modifiedby_key │ 74,197 │ 74,462 │ 0.996 mgd.mgi_relationship_property_pkey │ 74,197 │ 74,462 │ 0.996 mgd.mgi_relationship_property_idx_clustered│ 74,197 │ 74,462 │ 0.996 mgd.mgi_relationship_property_idx_createdby_key│ 74,197 │ 74,462 │ 0.996 mgd.seq_sequence_idx_clustered │ 50,051 │ 50,486 │ 0.991 mgd.seq_sequence_raw_pkey │ 35,826 │ 35,952 │ 0.996 mgd.seq_sequence_raw_idx_modifiedby_key│ 35,826 │ 35,952 │ 0.996 mgd.seq_source_assoc_idx_clustered │ 35,822 │ 35,952 │ 0.996 (20 rows) I haven't tried to make the
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Wed, Oct 19, 2016 at 11:33 AM, Peter Geoghegan wrote: > I don't think that eager merging will prove all that effective, > however it's implemented. I see a very I/O bound system when parallel > CREATE INDEX merges serially. There is no obvious reason why you'd > have a straggler worker process with CREATE INDEX, really. In an effort to head off any misunderstanding around this patch series, I started a new Wiki page for it: https://wiki.postgresql.org/wiki/Parallel_External_Sort This talks about parallel CREATE INDEX in particular, and uses of parallel external sort more generally, including future uses beyond CREATE INDEX. This approach worked very well for me during the UPSERT project, where a detailed overview really helped. With UPSERT, it was particularly difficult to keep the *current* state of things straight, such as current open items for the patch, areas of disagreement, and areas where there was no longer any disagreement or controversy. I don't think that this patch is even remotely as complicated as UPSERT was, but it's still something that has had several concurrently active mailing list threads (threads that are at least loosely related to the project), so I think that this will be useful. I welcome anyone with an interest in this project to review the Wiki page, add their own concerns to it with -hackers citation, and add their own content around related work. There is a kind of unresolved question around where the Gather Merge work might fit in to what I've come up with aleady. There may be other unresolved questions like that, that I'm not even aware of. I commit to maintaining the new Wiki page as a useful starting reference for understanding the current state of this patch. I hope this makes looking into the patch series less intimidating for potential reviewers. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Mon, Aug 1, 2016 at 3:18 PM, Peter Geoghegan wrote: > Setup: > > CREATE TABLE parallel_sort_test AS > SELECT hashint8(i) randint, > md5(i::text) collate "C" padding1, > md5(i::text || '2') collate "C" padding2 > FROM generate_series(0, 1e9::bigint) i; > > CHECKPOINT; > > This leaves us with a parallel_sort_test table that is 94 GB in size. > > SET maintenance_work_mem = '8GB'; > > -- Serial case (external sort, should closely match master branch): > CREATE INDEX serial_idx ON parallel_sort_test (randint) WITH > (parallel_workers = 0); > > Total time: 00:15:42.15 > > -- Patch with 8 tuplesort "sort-and-scan" workers (leader process > participates as a worker here): > CREATE INDEX patch_8_idx ON parallel_sort_test (randint) WITH > (parallel_workers = 7); > > Total time: 00:06:03.86 > > As you can see, the parallel case is 2.58x faster I decided to revisit this exact benchmark, using the same AWS instance type (the one with 16 HDDs, again configured in software RAID0) to see how things had changed for both parallel and serial cases. I am now testing V4. A lot changed in the last 3 months, with most of the changes that help here now already committed to the master branch. Relevant changes * Heikki's major overhaul of preload memory makes CREATE INDEX merging have more sequential access patterns. It also effectively allows us to use more memory. It's possible that the biggest benefit it brings to parallel CREATE INDEX is that is eliminates almost any random I/O penalty from logtape.c fragmentation that an extra merge pass has; parallel workers now usually do their own merge to produce one big run for the leader to merge. It also improves CPU cache efficiency quite directly, I think. This is the patch that helps most. Many thanks to Heikki for driving this forward. * My patch to simplify and optimize how the K-way merge heap is maintained (as tuples fill leaf pages of the final index structure) makes the merge phase significantly less CPU bound overall. (These first two items particularly help parallel CREATE INDEX, which spends proportionally much more wall clock time merging than would be expected for similar serial cases. Serial cases do of course also benefit.) * V2 of the patch (and all subsequent versions) apportioned slices of maintenance_work_mem to workers. maintenance_work_mem became a per-utility-operation budget, regardless of number of workers launched. This means that workers have less memory than the original V1 benchmark (they simply don't make use of it now), but this seems unlikely to hurt. Possibly, it even helps. * Andres' work on md.c scalability may have helped (seems unlikely with these CREATE INDEX cases that produce indexes not in the hundreds of gigabytes, though). It would help with *extremely* large index creation, which we won't really look at here. Things now look better than ever for the parallel CREATE INDEX patch. While it's typical for about 75% of wall clock time to be spent on sorting runs with serial CREATE INDEX, with the remaining 25% going on merging/writing index, with parallel CREATE INDEX I now generally see about a 50/50 split between parallel sorting of runs (including any worker merging to produce final runs) and serial merging for final on-the-fly merge where we actually write new index out as input is merged. This is a *significant* improvement over what we saw here back in August, where it was not uncommon for parallel CREATE INDEX to spend *twice* as much time in the serial final on-the-fly merge step. All improvements to the code that we've seen since August have targeted this final on-the-fly merge bottleneck. (The final on-the-fly merge is now *consistently* able to write out the index at a rate of 150MB/sec+ in my tests, which is pretty good.) New results == Same setup as one quoted above -- once again, we "SET maintenance_work_mem = '8GB'". -- Patch with 8 tuplesort "sort-and-scan" workers: CREATE INDEX patch_8_idx ON parallel_sort_test (randint) WITH (parallel_workers = 7); Total time: 00:04:24.93 -- Serial case: CREATE INDEX serial_idx ON parallel_sort_test (randint) WITH (parallel_workers = 0); Total time: 00:14:25.19 3.27x faster. Not bad. As you see in the quoted text, that was 2.58x back in August, even though the implementation now uses a lot less memory in parallel workers. And, that's without even considering the general question of how much faster index creation can be compared to Postgres 9.6 -- it's roughly 3.5x faster at times. New case Separately, using my gensort tool [1], I came up with a new test case. The tool generated a 2.5 billion row table, sized at 159GB. This is how long is takes to produce a 73GB index on the "sortkey" column of the resulting table: -- gensort "C" locale text parallel case: CREATE INDEX test8 on sort_test(sortkey) WITH (parallel_workers = 7); Total time: 00:16:19.63 -- gensort "C" locale text serial case: CREATE INDEX test0 on sort_test(sortk
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Fri, Oct 7, 2016 at 5:47 PM, Peter Geoghegan wrote: > Work is still needed on: > > * Cost model. Should probably attempt to guess final index size, and > derive calculation of number of workers from that. Also, I'm concerned > that I haven't given enough thought to the low end, where with default > settings most CREATE INDEX statements will use at least one parallel > worker. > > * The whole way that I teach nbtsort.c to disallow catalog tables for > parallel CREATE INDEX due to concerns about parallel safety is in need > of expert review, preferably from Robert. It's complicated in a way > that relies on things happening or not happening from a distance. > > * Heikki seems to want to change more about logtape.c, and its use of > indirection blocks. That may evolve, but for now I can only target the > master branch. > > * More extensive performance testing. I think that this V3 is probably > the fastest version yet, what with Heikki's improvements, but I > haven't really verified that. While I haven't made progress on any of these open items, I should still get a version out that applies cleanly on top of git tip -- commit b75f467b6eec0678452fd8d7f8d306e6df3a1076 caused the patch to bitrot. I attach V4, which is a fairly mechanical rebase of V3, with no notable behavioral changes or bug fixes. -- Peter Geoghegan 0003-Add-force_btree_randomaccess-GUC-for-testing.patch.gz Description: GNU Zip compressed data 0002-Add-parallel-B-tree-index-build-sorting.patch.gz Description: GNU Zip compressed data 0001-Cap-the-number-of-tapes-used-by-external-sorts.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Fri, Oct 21, 2016 at 4:25 PM, Amit Kapila wrote: > On Tue, Oct 18, 2016 at 3:48 AM, Peter Geoghegan wrote: >> On Mon, Oct 17, 2016 at 5:36 AM, Amit Kapila wrote: >> >> I read the following paragraph from the Volcano paper just now: >> >> """ >> During implementation and benchmarking of parallel sorting, we added >> two more features to exchange. First, we wanted to implement a merge >> network in which some processors produce sorted streams merge >> concurrently by other processors. Volcano’s sort iterator can be used >> to generate a sorted stream. A merge iterator was easily derived from >> the sort module. It uses a single level merge, instead of the cascaded >> merge of runs used in sort. The input of a merge iterator is an >> exchange. Differently from other operators, the merge iterator >> requires to distinguish the input records by their producer. As an >> example, for a join operation it does not matter where the input >> records were created, and all inputs can be accumulated in a single >> input stream. For a merge operation, it is crucial to distinguish the >> input records by their producer in order to merge multiple sorted >> streams correctly. >> """ >> >> I don't really understand this paragraph, but thought I'd ask: why the >> need to "distinguish the input records by their producer in order to >> merge multiple sorted streams correctly"? Isn't that talking about >> partitioning, where each workers *ownership* of a range matters? >> > > I think so, but it seems from above text that is mainly required for > merge iterator which probably will be used in merge join. > >> My >> patch doesn't care which values belong to which workers. And, it >> focuses quite a lot on dealing well with the memory bandwidth bound, >> I/O bound part of the sort where we write out the index itself, just >> by piggy-backing on tuplesort.c. I don't think that that's useful for >> a general-purpose executor node -- tuple-at-a-time processing when >> fetching from workers would kill performance. >> > > Right, but what is written in text quoted by you seems to be do-able > with tuple-at-a-time processing. > To be clear, by saying above, I don't mean that we should try that approach instead of what you are proposing, but it is worth some discussion to see if that has any significant merits. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Tue, Oct 18, 2016 at 3:48 AM, Peter Geoghegan wrote: > On Mon, Oct 17, 2016 at 5:36 AM, Amit Kapila wrote: > > I read the following paragraph from the Volcano paper just now: > > """ > During implementation and benchmarking of parallel sorting, we added > two more features to exchange. First, we wanted to implement a merge > network in which some processors produce sorted streams merge > concurrently by other processors. Volcano’s sort iterator can be used > to generate a sorted stream. A merge iterator was easily derived from > the sort module. It uses a single level merge, instead of the cascaded > merge of runs used in sort. The input of a merge iterator is an > exchange. Differently from other operators, the merge iterator > requires to distinguish the input records by their producer. As an > example, for a join operation it does not matter where the input > records were created, and all inputs can be accumulated in a single > input stream. For a merge operation, it is crucial to distinguish the > input records by their producer in order to merge multiple sorted > streams correctly. > """ > > I don't really understand this paragraph, but thought I'd ask: why the > need to "distinguish the input records by their producer in order to > merge multiple sorted streams correctly"? Isn't that talking about > partitioning, where each workers *ownership* of a range matters? > I think so, but it seems from above text that is mainly required for merge iterator which probably will be used in merge join. > My > patch doesn't care which values belong to which workers. And, it > focuses quite a lot on dealing well with the memory bandwidth bound, > I/O bound part of the sort where we write out the index itself, just > by piggy-backing on tuplesort.c. I don't think that that's useful for > a general-purpose executor node -- tuple-at-a-time processing when > fetching from workers would kill performance. > Right, but what is written in text quoted by you seems to be do-able with tuple-at-a-time processing. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Thu, Oct 20, 2016 at 12:03 AM, Peter Geoghegan wrote: > On Wed, Oct 19, 2016 at 7:39 AM, Robert Haas wrote: >> Gather Merge can't emit a tuple unless it has buffered at least one >> tuple from every producer; otherwise, the next tuple it receives from >> one of those producers might proceed whichever tuple it chooses to >> emit. Right. Now, after again looking at Gather Merge patch, I think I can better understand how it performs merging. >> However, it doesn't need to wait until all of the workers are >> completely done. The leader only needs to be at least slightly ahead >> of the slowest worker. I'm not sure how that compares to Peter's >> approach. > > I don't think that eager merging will prove all that effective, > however it's implemented. I see a very I/O bound system when parallel > CREATE INDEX merges serially. There is no obvious reason why you'd > have a straggler worker process with CREATE INDEX, really. > >> What I'm worried about is that we're implementing two separate systems >> to do the same thing, and that the parallel sort approach is actually >> a lot less general. I think it's possible to imagine a Parallel Sort >> implementation which does things Gather Merge can't. If all of the >> workers collaborate to sort all of the data rather than each worker >> sorting its own data, then you've got something which Gather Merge >> can't match. But this is not that. > > It's not that yet, certainly. I think I've sketched a path forward for > making partitioning a part of logtape.c that is promising. The sharing > of ranges within tapes and so on will probably have a significant > amount in common with what I've come up with. > > I don't think that any executor infrastructure is a particularly good > model when *batch output* is needed -- the tuple queue mechanism will > be a significant bottleneck, particularly because it does not > integrate read-ahead, etc. > Tuple queue mechanism might not be super-efficient for *batch output* (cases where many tuples needs to be read and written), but I see no reason why it will be slower than disk I/O which I think you are using in the patch. IIUC, in the patch each worker including leader does the tape sort for it's share of tuples and then finally leader merges and populates the index. I am not sure if the mechanism used in patch can be useful as compare to using tuple queue, if the workers can finish their part of sorting in-memory. > > The bottom line is that it's inherently difficult for me to refute the > idea that Gather Merge could do just as well as what I have here, > because proving that involves adding a significant amount of new > infrastructure (e.g., to teach the executor about IndexTuples). > I think, there could be a simpler way, like we can force the gather merge node when all the tuples needs to be sorted and compute the time till it merges all tuples. Similarly, with your patch, we can wait till final merge is completed. However, after doing initial study of both the patches, I feel one can construct cases where Gather Merge can win and also there will be cases where your patch can win. In particular, the Gather Merge can win where workers needs to perform sort mostly in-memory. I am not sure if it's easy to get best of both the worlds. Your patch needs rebase and I noticed one warning. sort\logtape.c(1422): warning C4700: uninitialized local variable 'lt' used -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Wed, Oct 19, 2016 at 7:39 AM, Robert Haas wrote: > Gather Merge can't emit a tuple unless it has buffered at least one > tuple from every producer; otherwise, the next tuple it receives from > one of those producers might proceed whichever tuple it chooses to > emit. However, it doesn't need to wait until all of the workers are > completely done. The leader only needs to be at least slightly ahead > of the slowest worker. I'm not sure how that compares to Peter's > approach. I don't think that eager merging will prove all that effective, however it's implemented. I see a very I/O bound system when parallel CREATE INDEX merges serially. There is no obvious reason why you'd have a straggler worker process with CREATE INDEX, really. > What I'm worried about is that we're implementing two separate systems > to do the same thing, and that the parallel sort approach is actually > a lot less general. I think it's possible to imagine a Parallel Sort > implementation which does things Gather Merge can't. If all of the > workers collaborate to sort all of the data rather than each worker > sorting its own data, then you've got something which Gather Merge > can't match. But this is not that. It's not that yet, certainly. I think I've sketched a path forward for making partitioning a part of logtape.c that is promising. The sharing of ranges within tapes and so on will probably have a significant amount in common with what I've come up with. I don't think that any executor infrastructure is a particularly good model when *batch output* is needed -- the tuple queue mechanism will be a significant bottleneck, particularly because it does not integrate read-ahead, etc. The best case that I saw advertised for Gather Merge was TPC-H query 9 [1]. That doesn't look like a good proxy for how Gather Merge adapted to parallel CREATE INDEX would do, since it benefits from the GroupAggregate merge having many equal values, possibly with a clustering in the original tables that can naturally be exploited (no TID tiebreaker needed, since IndexTuples are not being merged). Also, it looks like Gather Merge may do that well by enabling things, rather than parallelizing the sort effectively per se. Besides, the query 9 case is significantly less scalable than good cases for this parallel CREATE INDEX patch have already been shown to be. I think I've been pretty modest about what this parallel CREATE INDEX patch gets us from the beginning. It is a generalization of tuplesort.c to work in parallel; we need a lot more for that to make things like GroupAggregate as scalable as possible, and I don't pretend that this helps much with that. There are actually more changes to nbtsort.c to coordinate all of this than there are to tuplesort.c in the latest version, so I think that this simpler approach for parallel CREATE INDEX and CLUSTER is worthwhile. The bottom line is that it's inherently difficult for me to refute the idea that Gather Merge could do just as well as what I have here, because proving that involves adding a significant amount of new infrastructure (e.g., to teach the executor about IndexTuples). I think that the argument for this basic approach is sound (it appears to offer comparable scalability to the parallel CREATE INDEX implementations of other systems), but it's simply impractical for me to offer much reassurance beyond that. [1] https://github.com/tvondra/pg_tpch/blob/master/dss/templates/9.sql -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Mon, Oct 17, 2016 at 8:36 AM, Amit Kapila wrote: >> This project of mine is about parallelizing tuplesort.c, which isn't >> really what you want for parallel query -- you shouldn't try to scope >> the problem as "make the sort more scalable using parallelism" there. >> Rather, you want to scope it at "make the execution of the entire >> query more scalable using parallelism", which is really quite a >> different thing, which necessarily involves the executor having direct >> knowledge of partition boundaries. > > Okay, but what is the proof or why do you think second is going to > better than first? One thing which strikes as a major difference > between your approach and Gather Merge is that in your approach leader > has to wait till all the workers have done with their work on sorting > whereas with Gather Merge as soon as first one is done, leader starts > with merging. I could be wrong here, but if I understood it > correctly, then there is a argument that Gather Merge kind of approach > can win in cases where some of the workers can produce sorted outputs > ahead of others and I am not sure if we can dismiss such cases. Gather Merge can't emit a tuple unless it has buffered at least one tuple from every producer; otherwise, the next tuple it receives from one of those producers might proceed whichever tuple it chooses to emit. However, it doesn't need to wait until all of the workers are completely done. The leader only needs to be at least slightly ahead of the slowest worker. I'm not sure how that compares to Peter's approach. What I'm worried about is that we're implementing two separate systems to do the same thing, and that the parallel sort approach is actually a lot less general. I think it's possible to imagine a Parallel Sort implementation which does things Gather Merge can't. If all of the workers collaborate to sort all of the data rather than each worker sorting its own data, then you've got something which Gather Merge can't match. But this is not that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Mon, Oct 17, 2016 at 5:36 AM, Amit Kapila wrote: > Okay, but what is the proof or why do you think second is going to > better than first? I don't have proof. It's my opinion that it probably would be, based on partial information, and my intuition. It's hard to prove something like that, because it's not really clear what that alternative would look like. Also, finding all of that out would take a long time -- it's hard to prototype. Do tuple table slots need to care about IndexTuples now? What does that even look like? What existing executor code needs to be taught about this new requirement? > One thing which strikes as a major difference > between your approach and Gather Merge is that in your approach leader > has to wait till all the workers have done with their work on sorting > whereas with Gather Merge as soon as first one is done, leader starts > with merging. I could be wrong here, but if I understood it > correctly, then there is a argument that Gather Merge kind of approach > can win in cases where some of the workers can produce sorted outputs > ahead of others and I am not sure if we can dismiss such cases. How can it? You need to have at least one tuple from every worker (before the worker has exhausted its supply of output tuples) in order to merge to return the next tuple to the top level consumer (the thing above the Gather Merge). If you're talking about "eager vs. lazy merging", please see my previous remarks on that, on this thread. (In any case, whether we merge more eagerly seems like an orthogonal question to the one you ask.) The first thing to note about my approach is that I openly acknowledge that this parallel CREATE INDEX patch is not much use for parallel query. I have only generalized tuplesort.c to support parallelizing a sort operation. I think that parallel query needs partitioning to push down parts of a sort to workers, with little or no need for them to be funneled together at the end, since most tuples are eliminated before being passed to the Gather/Gather Merge node. The partitioning part is really hard. I guess that Gather Merge nodes have value because they allow us to preserve the sorted-ness of a parallel path, which might be most useful because it enables things elsewhere. But, that doesn't really recommend making Gather Merge nodes good at batch processing a large number of tuples, I suspect. (One problem with the tuple queue mechanism is that it can be a big bottleneck -- that's why we want to eliminate most tuples before they're passed up to the leader, in the case of parallel sequential scan in 9.6.) I read the following paragraph from the Volcano paper just now: """ During implementation and benchmarking of parallel sorting, we added two more features to exchange. First, we wanted to implement a merge network in which some processors produce sorted streams merge concurrently by other processors. Volcano’s sort iterator can be used to generate a sorted stream. A merge iterator was easily derived from the sort module. It uses a single level merge, instead of the cascaded merge of runs used in sort. The input of a merge iterator is an exchange. Differently from other operators, the merge iterator requires to distinguish the input records by their producer. As an example, for a join operation it does not matter where the input records were created, and all inputs can be accumulated in a single input stream. For a merge operation, it is crucial to distinguish the input records by their producer in order to merge multiple sorted streams correctly. """ I don't really understand this paragraph, but thought I'd ask: why the need to "distinguish the input records by their producer in order to merge multiple sorted streams correctly"? Isn't that talking about partitioning, where each workers *ownership* of a range matters? My patch doesn't care which values belong to which workers. And, it focuses quite a lot on dealing well with the memory bandwidth bound, I/O bound part of the sort where we write out the index itself, just by piggy-backing on tuplesort.c. I don't think that that's useful for a general-purpose executor node -- tuple-at-a-time processing when fetching from workers would kill performance. > By looking at 'workersFinished' usage, it looks like you have devised > a new way for leader to know when workers have finished which might be > required for this patch. However, have you tried to use or > investigate if existing infrastructure which serves same purpose could > be used for it? Yes, I have. I think that Robert's "condition variables" patch would offer a general solution to what I've devised. What I have there is, as you say, fairly ad-hoc, even though my requirements are actually fairly general. I was actually annoyed that there wasn't an easier way to do that myself. Robert has said that he won't commit his "condition variables" work until it's clear that there will be some use for the facility. Well, I'd use it for this patch, if
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Thu, Oct 13, 2016 at 12:35 AM, Peter Geoghegan wrote: > On Wed, Oct 12, 2016 at 11:09 AM, Robert Haas wrote: > >> On the flip side, what if anything can queries hope to get out of >> parallel sort that they can't get out of Gather Merge? One >> possibility is that a parallel sort might end up being substantially >> faster than Gather Merge-over-non-parallel sort. In that case, we >> obviously should prefer it. > > I must admit that I don't know enough about it to comment just yet. > Offhand, it occurs to me that the Gather Merge sorted input could come > from a number of different types of paths/nodes, whereas adopting what > I've done here could only work more or less equivalently to "Gather > Merge -> Sort -> Seq Scan" -- a special case, really. > >> For example, it's possible that you might want to have all >> workers participate in sorting some data and then divide the result of >> the sort into equal ranges that are again divided among the workers, >> or that you might want all workers to sort and then each worker to >> read a complete copy of the output data. But these don't seem like >> particularly mainstream needs, nor do they necessarily seem like >> problems that parallel sort itself should be trying to solve. > > This project of mine is about parallelizing tuplesort.c, which isn't > really what you want for parallel query -- you shouldn't try to scope > the problem as "make the sort more scalable using parallelism" there. > Rather, you want to scope it at "make the execution of the entire > query more scalable using parallelism", which is really quite a > different thing, which necessarily involves the executor having direct > knowledge of partition boundaries. > Okay, but what is the proof or why do you think second is going to better than first? One thing which strikes as a major difference between your approach and Gather Merge is that in your approach leader has to wait till all the workers have done with their work on sorting whereas with Gather Merge as soon as first one is done, leader starts with merging. I could be wrong here, but if I understood it correctly, then there is a argument that Gather Merge kind of approach can win in cases where some of the workers can produce sorted outputs ahead of others and I am not sure if we can dismiss such cases. +struct Sharedsort +{ .. + * Workers increment workersFinished to indicate having finished. If + * this is equal to state.launched within the leader, leader is ready + * to merge runs. + * + * leaderDone indicates if leader is completely done (i.e., was + * tuplesort_end called against the state through which parallel output + * was consumed?) + */ + int currentWorker; + int workersFinished; .. } By looking at 'workersFinished' usage, it looks like you have devised a new way for leader to know when workers have finished which might be required for this patch. However, have you tried to use or investigate if existing infrastructure which serves same purpose could be used for it? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Wed, Oct 12, 2016 at 11:09 AM, Robert Haas wrote: > I realize that you are primarily targeting utility commands here, and > that is obviously great, because making index builds faster is very > desirable. However, I'd just like to talk for a minute about how this > relates to parallel query. With Rushabh's Gather Merge patch, you can > now have a plan that looks like Gather Merge -> Sort -> whatever. > That patch also allows other patterns that are useful completely > independently of this patch, like Finalize GroupAggregate -> Gather > Merge -> Partial GroupAggregate -> Sort -> whatever, but the Gather > Merge -> Sort -> whatever path is very related to what this patch > does. For example, instead of committing this patch at all, we could > try to funnel index creation through the executor, building a plan of > that shape, and using the results to populate the index. I'm not > saying that's a good idea, but it could be done. Right, but that would be essentially the same approach as mine, but, I suspect, less efficient and more complicated. More importantly, it wouldn't be partitioning, and partitioning is what we really need within the executor. > On the flip side, what if anything can queries hope to get out of > parallel sort that they can't get out of Gather Merge? One > possibility is that a parallel sort might end up being substantially > faster than Gather Merge-over-non-parallel sort. In that case, we > obviously should prefer it. I must admit that I don't know enough about it to comment just yet. Offhand, it occurs to me that the Gather Merge sorted input could come from a number of different types of paths/nodes, whereas adopting what I've done here could only work more or less equivalently to "Gather Merge -> Sort -> Seq Scan" -- a special case, really. > For example, it's possible that you might want to have all > workers participate in sorting some data and then divide the result of > the sort into equal ranges that are again divided among the workers, > or that you might want all workers to sort and then each worker to > read a complete copy of the output data. But these don't seem like > particularly mainstream needs, nor do they necessarily seem like > problems that parallel sort itself should be trying to solve. This project of mine is about parallelizing tuplesort.c, which isn't really what you want for parallel query -- you shouldn't try to scope the problem as "make the sort more scalable using parallelism" there. Rather, you want to scope it at "make the execution of the entire query more scalable using parallelism", which is really quite a different thing, which necessarily involves the executor having direct knowledge of partition boundaries. Maybe the executor enlists tuplesort.c to help with those boundaries to some degree, but that whole thing is basically something which treats tuplesort.c as a low level primitive. > The > Volcano paper[1], one of the oldest and most-cited sources I can find > for research into parallel execution and with a design fairly similar > to our own executor, describes various variants of what they call > Exchange, of which what we now call Gather is one. I greatly respect the work of Goetz Graef, including his work on the Volcano paper. Graef has been the single biggest external influence on my work on Postgres. > They describe > another variant called Interchange, which acts like a Gather node > without terminating parallelism: every worker process reads the > complete output of an Interchange, which is the union of all rows > produced by all workers running the Interchange's input plan. That > seems like a better design than coupling such data flows specifically > to parallel sort. > > I'd like to think that parallel sort will help lots of queries, as > well as helping utility commands, but I'm not sure it will. Thoughts? You are right that I'm targeting the cases where we can get real benefits without really changing the tuplesort.h contract too much. This is literally the parallel tuplesort.c patch, which probably isn't very useful for parallel query, because the final output is always consumed serially here (this doesn't matter all that much for CREATE INDEX, I believe). This approach of mine seems like the simplest way of getting a large benefit to users involving parallelizing sorting, but I certainly don't imagine it to be the be all and end all. I have at least tried to anticipate how tuplesort.c will eventually serve the needs of partitioning for the benefit of parallel query. My intuition is that you'll have to teach it about partitioning boundaries fairly directly -- it won't do to add something generic to the executor. And, it probably won't be the only thing that needs to be taught about them. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Fri, Oct 7, 2016 at 5:47 PM, Peter Geoghegan wrote: > Work is still needed on: > > * Cost model. Should probably attempt to guess final index size, and > derive calculation of number of workers from that. Also, I'm concerned > that I haven't given enough thought to the low end, where with default > settings most CREATE INDEX statements will use at least one parallel > worker. > > * The whole way that I teach nbtsort.c to disallow catalog tables for > parallel CREATE INDEX due to concerns about parallel safety is in need > of expert review, preferably from Robert. It's complicated in a way > that relies on things happening or not happening from a distance. > > * Heikki seems to want to change more about logtape.c, and its use of > indirection blocks. That may evolve, but for now I can only target the > master branch. > > * More extensive performance testing. I think that this V3 is probably > the fastest version yet, what with Heikki's improvements, but I > haven't really verified that. I realize that you are primarily targeting utility commands here, and that is obviously great, because making index builds faster is very desirable. However, I'd just like to talk for a minute about how this relates to parallel query. With Rushabh's Gather Merge patch, you can now have a plan that looks like Gather Merge -> Sort -> whatever. That patch also allows other patterns that are useful completely independently of this patch, like Finalize GroupAggregate -> Gather Merge -> Partial GroupAggregate -> Sort -> whatever, but the Gather Merge -> Sort -> whatever path is very related to what this patch does. For example, instead of committing this patch at all, we could try to funnel index creation through the executor, building a plan of that shape, and using the results to populate the index. I'm not saying that's a good idea, but it could be done. On the flip side, what if anything can queries hope to get out of parallel sort that they can't get out of Gather Merge? One possibility is that a parallel sort might end up being substantially faster than Gather Merge-over-non-parallel sort. In that case, we obviously should prefer it. Other possibilities seem a little obscure. For example, it's possible that you might want to have all workers participate in sorting some data and then divide the result of the sort into equal ranges that are again divided among the workers, or that you might want all workers to sort and then each worker to read a complete copy of the output data. But these don't seem like particularly mainstream needs, nor do they necessarily seem like problems that parallel sort itself should be trying to solve. The Volcano paper[1], one of the oldest and most-cited sources I can find for research into parallel execution and with a design fairly similar to our own executor, describes various variants of what they call Exchange, of which what we now call Gather is one. They describe another variant called Interchange, which acts like a Gather node without terminating parallelism: every worker process reads the complete output of an Interchange, which is the union of all rows produced by all workers running the Interchange's input plan. That seems like a better design than coupling such data flows specifically to parallel sort. I'd like to think that parallel sort will help lots of queries, as well as helping utility commands, but I'm not sure it will. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company [1] "Volcano - an Extensible and Parallel Query Evaluation System", https://pdfs.semanticscholar.org/865b/5f228f08ebac0b68d3a4bfd97929ee85e4b6.pdf [2] See "C. Variants of the Exchange Operator" on p. 13 of [1] -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Sun, Sep 11, 2016 at 11:05 AM, Peter Geoghegan wrote: > So, while there are still a few loose ends with this revision (it > should still certainly be considered WIP), I wanted to get a revision > out quickly because V1 has been left to bitrot for too long now, and > my schedule is very full for the next week, ahead of my leaving to go > on vacation (which is long overdue). Hopefully, I'll be able to get > out a third revision next Saturday, on top of the > by-then-presumably-committed new tape batch memory patch from Heikki, > just before I leave. I'd rather leave with a patch available that can > be cleanly applied, to make review as easy as possible, since it > wouldn't be great to have this V2 with bitrot for 10 days or more. Heikki committed his preload memory patch a little later than originally expected, 4 days ago. I attach V3 of my own parallel CREATE INDEX patch, which should be applied on top of a today's git master (there is a bugfix that reviewers won't want to miss -- commit b56fb691). I have my own test suite, and have to some extent used TDD for this patch, so rebasing was not so bad . My tests are rather rough and ready, so I'm not going to post them here. (Changes in the WaitLatch() API also caused bitrot, which is now fixed.) Changes from V2: * Since Heikki eliminated the need for any extra memtuple "slots" (memtuples is now only exactly big enough for the initial merge heap), an awful lot of code could be thrown out that managed sizing memtuples in the context of the parallel leader (based on trends seen in parallel workers). I was able to follow Heikki's example by eliminating code for parallel sorting memtuples sizing. Throwing this code out let me streamline a lot of stuff within tuplesort.c, which is cleaned up quite a bit. * Since this revision was mostly focused on fixing up logtape.c (rebasing on top of Heikki's work), I also took the time to clarify some things about how an block-based offset might need to be applied within the leader. Essentially, outlining how and where that happens, and where it doesn't and shouldn't happen. (An offset must sometimes be applied to compensate for difference in logical BufFile positioning (leader/worker differences) following leader's unification of worker tapesets into one big tapset of its own.) * max_parallel_workers_maintenance now supersedes the use of the new parallel_workers index storage parameter. This matches existing heap storage parameter behavior, and allows the implementation to add practically no cycles as compared to master branch when the use of parallelism is disabled by setting max_parallel_workers_maintenance to 0. * New additions to the chapter in the documentation that Robert added a little while back, "Chapter 15. Parallel Query". It's perhaps a bit of a stretch to call this feature part of parallel query, but I think that it works reasonably well. The optimizer does determine the number of workers needed here, so while it doesn't formally produce a query plan, I think the implication that it does is acceptable for user-facing documentation. (Actually, it would be nice if you really could EXPLAIN utility commands -- that would be a handy place to show information about how they were executed.) Maybe this new documentation describes things in what some would consider to be excessive detail for users. The relatively detailed information added on parallel sorting seemed to be in the pragmatic spirit of the new chapter 15, so I thought I'd see what people thought. Work is still needed on: * Cost model. Should probably attempt to guess final index size, and derive calculation of number of workers from that. Also, I'm concerned that I haven't given enough thought to the low end, where with default settings most CREATE INDEX statements will use at least one parallel worker. * The whole way that I teach nbtsort.c to disallow catalog tables for parallel CREATE INDEX due to concerns about parallel safety is in need of expert review, preferably from Robert. It's complicated in a way that relies on things happening or not happening from a distance. * Heikki seems to want to change more about logtape.c, and its use of indirection blocks. That may evolve, but for now I can only target the master branch. * More extensive performance testing. I think that this V3 is probably the fastest version yet, what with Heikki's improvements, but I haven't really verified that. -- Peter Geoghegan 0001-Cap-the-number-of-tapes-used-by-external-sorts.patch.gz Description: GNU Zip compressed data 0002-Add-parallel-B-tree-index-build-sorting.patch.gz Description: GNU Zip compressed data 0003-Add-force_btree_randomaccess-GUC-for-testing.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Mon, Sep 26, 2016 at 3:40 PM, Peter Geoghegan wrote: > On Mon, Sep 26, 2016 at 6:58 PM, Robert Haas wrote: >>> That requires some kind of mutual exclusion mechanism, like an LWLock. >> >> No, it doesn't. Shared memory queues are single-reader, single-writer. > > The point is that there is a natural dependency when merging is > performed eagerly within the leader. One thing needs to be in lockstep > with the others. That's all. I don't know what any of that means. You said we need something like an LWLock, but I think we don't. The workers just write the results of their own final merges into shm_mqs. The leader can read from any given shm_mq until no more tuples can be read without blocking, just like nodeGather.c does, or at least it can do that unless its own queue fills up first. No mutual exclusion mechanism is required for any of that, as far as I can see - not an LWLock, and not anything similar. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Mon, Sep 26, 2016 at 6:58 PM, Robert Haas wrote: >> That requires some kind of mutual exclusion mechanism, like an LWLock. > > No, it doesn't. Shared memory queues are single-reader, single-writer. The point is that there is a natural dependency when merging is performed eagerly within the leader. One thing needs to be in lockstep with the others. That's all. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Sat, Sep 24, 2016 at 9:07 AM, Peter Geoghegan wrote: > On Thu, Sep 22, 2016 at 8:57 PM, Robert Haas wrote: >> On Thu, Sep 22, 2016 at 3:51 PM, Heikki Linnakangas wrote: >>> It'd be good if you could overlap the final merges in the workers with the >>> merge in the leader. ISTM it would be quite straightforward to replace the >>> final tape of each worker with a shared memory queue, so that the leader >>> could start merging and returning tuples as soon as it gets the first tuple >>> from each worker. Instead of having to wait for all the workers to complete >>> first. >> >> If you do that, make sure to have the leader read multiple tuples at a >> time from each worker whenever possible. It makes a huge difference >> to performance. See bc7fcab5e36b9597857fa7e3fa6d9ba54aaea167. > > That requires some kind of mutual exclusion mechanism, like an LWLock. No, it doesn't. Shared memory queues are single-reader, single-writer. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Thu, Sep 22, 2016 at 8:57 PM, Robert Haas wrote: > On Thu, Sep 22, 2016 at 3:51 PM, Heikki Linnakangas wrote: >> It'd be good if you could overlap the final merges in the workers with the >> merge in the leader. ISTM it would be quite straightforward to replace the >> final tape of each worker with a shared memory queue, so that the leader >> could start merging and returning tuples as soon as it gets the first tuple >> from each worker. Instead of having to wait for all the workers to complete >> first. > > If you do that, make sure to have the leader read multiple tuples at a > time from each worker whenever possible. It makes a huge difference > to performance. See bc7fcab5e36b9597857fa7e3fa6d9ba54aaea167. That requires some kind of mutual exclusion mechanism, like an LWLock. It's possible that merging everything lazily is actually the faster approach, given this, and given the likely bottleneck on I/O at htis stage. It's also certainly simpler to not overlap things. This is something I've read about before [1], with "eager evaluation" sorting not necessarily coming out ahead IIRC. [1] http://digitalcommons.ohsu.edu/cgi/viewcontent.cgi?article=1193&context=csetech -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Wed, Sep 21, 2016 at 5:52 PM, Heikki Linnakangas wrote: > I find this unification business really complicated. I can certainly understand why you would. As I said, it's the most complicated part of the patch, which overall is one of the most ambitious patches I've ever written. > I think it'd be simpler > to keep the BufFiles and LogicalTapeSets separate, and instead teach > tuplesort.c how to merge tapes that live on different > LogicalTapeSets/BufFiles. Or refactor LogicalTapeSet so that a single > LogicalTapeSet can contain tapes from different underlying BufFiles. > > What I have in mind is something like the attached patch. It refactors > LogicalTapeRead(), LogicalTapeWrite() etc. functions to take a LogicalTape > as argument, instead of LogicalTapeSet and tape number. LogicalTapeSet > doesn't have the concept of a tape number anymore, it can contain any number > of tapes, and you can create more on the fly. With that, it'd be fairly easy > to make tuplesort.c merge LogicalTapes that came from different tape sets, > backed by different BufFiles. I think that'd avoid much of the unification > code. I think that it won't be possible to make a LogicalTapeSet ever use more than one BufFile without regressing the ability to eagerly reuse space, which is almost the entire reason for logtape.c existing. The whole indirect block thing is an idea borrowed from the FS world, of course, and so logtape.c needs one block-device-like BufFile, with blocks that can be reclaimed eagerly, but consumed for recycling in *contiguous* order (which is why they're sorted using qsort() within ltsGetFreeBlock()). You're going to increase the amount of random I/O by using more than one BufFile for an entire tapeset, I think. This patch you posted ("0001-Refactor-LogicalTapeSet-LogicalTape-interface.patch") just keeps one BufFile, and only changes the interface to expose the tapes themselves to tuplesort.c, without actually making tuplesort.c do anything with that capability. I see what you're getting at, I think, but I don't see how that accomplishes all that much for parallel CREATE INDEX. I mean, the special case of having multiple tapesets from workers (not one "unified" tapeset created from worker temp files from their tapesets to begin with) now needs special treatment. Haven't you just moved the complexity around (once your patch is made to care about parallelism)? Having multiple entire tapesets explicitly from workers, with their own BufFiles, is not clearly less complicated than managing ranges from BufFile fd.c files with delineated ranges of "logical tapeset space". Seems almost equivalent, except that my way doesn't bother tuplesort.c with any of this. >> +* As a consequence of only being permitted to write to the leader >> +* controlled range, parallel sorts that require a final >> materialized tape >> +* will use approximately twice the disk space for temp files >> compared to >> +* a more or less equivalent serial sort. > I'm slightly worried about that. Maybe it's OK for a first version, but it'd > be annoying in a query where a sort is below a merge join, for example, so > that you can't do the final merge on the fly because mark/restore support is > needed. My intuition is that we'll *never* end up using this for merge joins. I think that I could do better here (why should workers really care at this point?), but just haven't bothered to. This parallel sort implementation is something written with CREATE INDEX and CLUSTER in mind only (maybe one or two other things, too). I believe that for query execution, partitioning is the future [1]. With merge joins, partitioning is desirable because it lets you push down *everything* to workers, not just sorting (e.g., by aligning partitioning boundaries on each side of each merge join sort in the worker, and having the worker also "synchronize" each side of the join, all independently and without a dependency on a final merge). That's why I think it's okay that I use twice as much space for randomAccess tuplesort.c callers. No real world caller will ever end up needing to do this. It just seems like a good idea to support randomAccess when using this new infrastructure, on general principle. Forcing myself to support that case during initial development actually resulted in much cleaner, less invasive changes to tuplesort.c in general. [1] https://www.postgresql.org/message-id/flat/CAM3SWZR+ATYAzyMT+hm-Bo=1l1smtjbndtibwbtktyqs0dy...@mail.gmail.com#CAM3SWZR+ATYAzyMT+hm-Bo=1l1smtjbndtibwbtktyqs0dy...@mail.gmail.com -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Thu, Sep 22, 2016 at 3:51 PM, Heikki Linnakangas wrote: > It'd be good if you could overlap the final merges in the workers with the > merge in the leader. ISTM it would be quite straightforward to replace the > final tape of each worker with a shared memory queue, so that the leader > could start merging and returning tuples as soon as it gets the first tuple > from each worker. Instead of having to wait for all the workers to complete > first. If you do that, make sure to have the leader read multiple tuples at a time from each worker whenever possible. It makes a huge difference to performance. See bc7fcab5e36b9597857fa7e3fa6d9ba54aaea167. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On 08/02/2016 01:18 AM, Peter Geoghegan wrote: No merging in parallel -- Currently, merging worker *output* runs may only occur in the leader process. In other words, we always keep n worker processes busy with scanning-and-sorting (and maybe some merging), but then all processes but the leader process grind to a halt (note that the leader process can participate as a scan-and-sort tuplesort worker, just as it will everywhere else, which is why I specified "parallel_workers = 7" but talked about 8 workers). One leader process is kept busy with merging these n output runs on the fly, so things will bottleneck on that, which you saw in the example above. As already described, workers will sometimes merge in parallel, but only their own runs -- never another worker's runs. I did attempt to address the leader merge bottleneck by implementing cross-worker run merging in workers. I got as far as implementing a very rough version of this, but initial results were disappointing, and so that was not pursued further than the experimentation stage. Parallel merging is a possible future improvement that could be added to what I've come up with, but I don't think that it will move the needle in a really noticeable way. It'd be good if you could overlap the final merges in the workers with the merge in the leader. ISTM it would be quite straightforward to replace the final tape of each worker with a shared memory queue, so that the leader could start merging and returning tuples as soon as it gets the first tuple from each worker. Instead of having to wait for all the workers to complete first. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On 08/02/2016 01:18 AM, Peter Geoghegan wrote: Tape unification Sort operations have a unique identifier, generated before any workers are launched, using a scheme based on the leader's PID, and a unique temp file number. This makes all on-disk state (temp files managed by logtape.c) discoverable by the leader process. State in shared memory is sized in proportion to the number of workers, so the only thing about the data being sorted that gets passed around in shared memory is a little logtape.c metadata for tapes, describing for example how large each constituent BufFile is (a BufFile associated with one particular worker's tapeset). (See below also for notes on buffile.c's role in all of this, fd.c and resource management, etc.) > ... buffile.c, and "unification" There has been significant new infrastructure added to make logtape.c aware of workers. buffile.c has in turn been taught about unification as a first class part of the abstraction, with low-level management of certain details occurring within fd.c. So, "tape unification" within processes to open other backend's logical tapes to generate a unified logical tapeset for the leader to merge is added. This is probably the single biggest source of complexity for the patch, since I must consider: * Creating a general, reusable abstraction for other possible BufFile users (logtape.c only has to serve tuplesort.c, though). * Logical tape free space management. * Resource management, file lifetime, etc. fd.c resource management can now close a file at xact end for temp files, while not deleting it in the leader backend (only the "owning" worker backend deletes the temp file it owns). * Crash safety (e.g., when to truncate existing temp files, and when not to). I find this unification business really complicated. I think it'd be simpler to keep the BufFiles and LogicalTapeSets separate, and instead teach tuplesort.c how to merge tapes that live on different LogicalTapeSets/BufFiles. Or refactor LogicalTapeSet so that a single LogicalTapeSet can contain tapes from different underlying BufFiles. What I have in mind is something like the attached patch. It refactors LogicalTapeRead(), LogicalTapeWrite() etc. functions to take a LogicalTape as argument, instead of LogicalTapeSet and tape number. LogicalTapeSet doesn't have the concept of a tape number anymore, it can contain any number of tapes, and you can create more on the fly. With that, it'd be fairly easy to make tuplesort.c merge LogicalTapes that came from different tape sets, backed by different BufFiles. I think that'd avoid much of the unification code. That leaves one problem, though: reusing space in the final merge phase. If the tapes being merged belong to different LogicalTapeSets, and create one new tape to hold the result, the new tape cannot easily reuse the space of the input tapes because they are on different tape sets. But looking at your patch, ISTM you actually dodged that problem as well: +* As a consequence of only being permitted to write to the leader +* controlled range, parallel sorts that require a final materialized tape +* will use approximately twice the disk space for temp files compared to +* a more or less equivalent serial sort. This is deemed acceptable, +* since it is far rarer in practice for parallel sort operations to +* require a final materialized output tape. Note that this does not +* apply to any merge process required by workers, which may reuse space +* eagerly, just like conventional serial external sorts, and so +* typically, parallel sorts consume approximately the same amount of disk +* blocks as a more or less equivalent serial sort, even when workers must +* perform some merging to produce input to the leader. I'm slightly worried about that. Maybe it's OK for a first version, but it'd be annoying in a query where a sort is below a merge join, for example, so that you can't do the final merge on the fly because mark/restore support is needed. One way to fix that would be have all the parallel works share the work files to begin with, and keep the "nFileBlocks" value in shared memory so that the workers won't overlap each other. Then all the blocks from different workers would be mixed together, though, which would hurt the sequential pattern of the tapes, so each workers would need to allocate larger chunks to avoid that. - Heikki >From a1aa45c22cd13a2059880154e30f48d884a849ef Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Wed, 21 Sep 2016 19:31:33 +0300 Subject: [PATCH 1/1] Refactor LogicalTapeSet/LogicalTape interface. A LogicalTape is now visible to callers, not just an internal object within logtape.c. All the tape functions, like LogicalTapeRead and LogicalTapeWrite, take a LogicalTape as argument, instead of LogicalTapeSet+tape number. You can cre
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Sun, Sep 11, 2016 at 2:05 PM, Peter Geoghegan wrote: > On Sun, Sep 11, 2016 at 6:28 AM, Heikki Linnakangas wrote: >> Pushed this "displace root" patch, with some changes: > > Attached is rebased version of the entire patch series, which should > be applied on top of what you pushed to the master branch today. 0003 looks like a sensible cleanup of our #include structure regardless of anything this patch series is trying to accomplish, so I've committed it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Sun, Sep 11, 2016 at 6:28 AM, Heikki Linnakangas wrote: > Pushed this "displace root" patch, with some changes: Attached is rebased version of the entire patch series, which should be applied on top of what you pushed to the master branch today. This features a new scheme for managing workMem -- maintenance_work_mem is now treated as a high watermark/budget for the entire CREATE INDEX operation, regardless of the number of workers. This seems to work much better, so Robert was right to suggest it. There were also improvements to the cost model, to weigh available maintenance_work_mem under this new system. And, the cost model was moved inside planner.c (next to plan_cluster_use_sort()), which is really where it belongs. The cost model is still WIP, though, and I didn't address some concerns of my own about how tuplesort.c coordinates workers. I think that Robert's "condition variables" will end up superseding that stuff anyway. And, I think that this v2 will bitrot fairly soon, when Heikki commits what is in effect his version of my 0002-* patch (that's unchanged, if only because it refactors some things that the parallel CREATE INDEX patch is reliant on). So, while there are still a few loose ends with this revision (it should still certainly be considered WIP), I wanted to get a revision out quickly because V1 has been left to bitrot for too long now, and my schedule is very full for the next week, ahead of my leaving to go on vacation (which is long overdue). Hopefully, I'll be able to get out a third revision next Saturday, on top of the by-then-presumably-committed new tape batch memory patch from Heikki, just before I leave. I'd rather leave with a patch available that can be cleanly applied, to make review as easy as possible, since it wouldn't be great to have this V2 with bitrot for 10 days or more. -- Peter Geoghegan 0003-Rearrange-header-file-include-directives.patch.gz Description: GNU Zip compressed data 0005-Add-force_btree_randomaccess-GUC-for-testing.patch.gz Description: GNU Zip compressed data 0001-Cap-the-number-of-tapes-used-by-external-sorts.patch.gz Description: GNU Zip compressed data 0002-Use-tuplesort-batch-memory-for-randomAccess-sorts.patch.gz Description: GNU Zip compressed data 0004-Add-parallel-B-tree-index-build-sorting.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Sun, Sep 11, 2016 at 6:28 AM, Heikki Linnakangas wrote: > * I renamed "tuplesort_heap_siftup()" to "tuplesort_delete_top()". I realize > that this is controversial, per the discussion on the "Is > tuplesort_heap_siftup() a misnomer?" thread. However, now that we have a new > function, "tuplesort_heap_replace_top()", which is exactly the same > algorithm as the "delete_top()" algorithm, calling one of them "siftup" > became just too confusing. I feel pretty strongly that this was the correct decision. I would have gone further, and removed any mention of "Sift up", but you can't win them all. > * Instead of "root_displace", I used the name "replace_top", and > "delete_top" for the old siftup function. Because we use "top" to refer to > memtuples[0] more commonly than "root", in the existing comments. Fine by me. > * I shared the code between the delete-top and replace-top. Delete-top now > calls the replace-top function, with the last element of the heap. Both > functions have the same signature, i.e. they both take the checkIndex > argument. Peter's patch left that out for the "replace" function, on > performance grounds, but if that's worthwhile, that seems like a separate > optimization. Might be worth benchmarking that separately, but I didn't want > to conflate that with this patch. Okay. > * I replaced a few more siftup+insert calls with the new combined > replace-top operation. Because why not. I suppose that the consistency has value, from a code clarity standpoint. > Thanks for the patch, Peter, and thanks for the review, Claudio! Thanks Heikki! -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On 09/10/2016 03:22 AM, Claudio Freire wrote: Overall, however, I believe the patch is in good shape. Only minor form issues need to be changed, the functionality seems both desirable and ready. Pushed this "displace root" patch, with some changes: * I renamed "tuplesort_heap_siftup()" to "tuplesort_delete_top()". I realize that this is controversial, per the discussion on the "Is tuplesort_heap_siftup() a misnomer?" thread. However, now that we have a new function, "tuplesort_heap_replace_top()", which is exactly the same algorithm as the "delete_top()" algorithm, calling one of them "siftup" became just too confusing. If anything, the new "replace_top" corresponds more closely to Knuth's siftup algorithm; delete-top is a special case of it. I added a comment on that to replace_top. I hope everyone can live with this. * Instead of "root_displace", I used the name "replace_top", and "delete_top" for the old siftup function. Because we use "top" to refer to memtuples[0] more commonly than "root", in the existing comments. * I shared the code between the delete-top and replace-top. Delete-top now calls the replace-top function, with the last element of the heap. Both functions have the same signature, i.e. they both take the checkIndex argument. Peter's patch left that out for the "replace" function, on performance grounds, but if that's worthwhile, that seems like a separate optimization. Might be worth benchmarking that separately, but I didn't want to conflate that with this patch. * I replaced a few more siftup+insert calls with the new combined replace-top operation. Because why not. Thanks for the patch, Peter, and thanks for the review, Claudio! - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Fri, Sep 9, 2016 at 5:22 PM, Claudio Freire wrote: > Since it is true that doing so would make it impossible to keep the > asserts about tupindex in tuplesort_heap_root_displace, I guess it > depends on how useful those asserts are (ie: how likely it is that > those conditions could be violated, and how damaging it could be if > they were). If it is decided the refactor is desirable, I'd suggest > making the common siftup producedure static inline, to allow > tuplesort_heap_root_displace to inline and specialize it, since it > will be called with checkIndex=False and that simplifies the resulting > code considerably. Right. I want to keep it as a separate function for all these reasons. I also think that I'll end up further optimizing what I've called tuplesort_heap_root_displace in the future, to adopt to clustered input. I'm thinking of something like Timsort's "galloping mode". What I've come up with here still needs 2 comparisons and a swap per call for presorted input. There is still a missed opportunity for clustered or (inverse) correlated input -- we can make merging opportunistically skip ahead to determine that the root tape's 100th tuple (say) would still fit in the root position of the merge minheap. So, immediately return 100 tuples from the root's tape without bothering to compare them to anything. Do a binary search to find the best candidate minheap root before the 100th tuple if a guess of 100 doesn't work out. Adapt to trends. Stuff like that. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
... On Fri, Sep 9, 2016 at 9:22 PM, Claudio Freire wrote: > Since it is true that doing so would make it impossible to keep the > asserts about tupindex in tuplesort_heap_root_displace, I guess it > depends on how useful those asserts are (ie: how likely it is that > those conditions could be violated, and how damaging it could be if > they were). If it is decided the refactor is desirable, I'd suggest > making the common siftup producedure static inline, to allow > tuplesort_heap_root_displace to inline and specialize it, since it > will be called with checkIndex=False and that simplifies the resulting > code considerably. > > Peter also mentioned that there were some other changes going on in > the surrounding code that could impact this patch, so I'm marking the > patch Waiting on Author. > > Overall, however, I believe the patch is in good shape. Only minor > form issues need to be changed, the functionality seems both desirable > and ready. Sorry, forgot to specify, that was all about patch 3, the one about tuplesort_heap_root_displace. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Wed, Sep 7, 2016 at 2:36 PM, Heikki Linnakangas wrote: > 3. If we do that, we'll still have to reserve the tape buffers for all the > tapes that we use during merge. So after we've built the initial runs, we'll > need to reserve memory for those buffers. That might require shrinking > memtuples. But that's OK: after building the initial runs, memtuples is > empty, so we can shrink it. Do you really think all this is worth the effort? Given how things are going to improve for merging anyway, I tend to doubt it. I'd rather just apply the cap (not necessarily 501 tapes, but something), and be done with it. As you know, Knuth never advocated more than 7 tapes at once, which I don't think had anything to do with the economics of tape drives in the 1970s (or problems with tape operators getting repetitive strange injuries). There is a chart in volume 3 about this. Senior hackers talked about a cap like this from day one, back in 2006, when Simon and Tom initially worked on scaling the number of tapes. Alternatively, we could make MERGE_BUFFER_SIZE much larger, which I think would be a good idea independent of whatever waste logically allocation of never-used tapes presents us with. It's currently 1/4 of 1MiB, which is hardly anything these days, and doesn't seem to have much to do with OS read ahead trigger sizes. If we were going to do something like you describe here, I'd prefer it to be driven by an observable benefit in performance, rather than a theoretical benefit. Not doing everything in one pass isn't necessarily worse than having a less cache efficient heap -- it might be quite a bit better, in fact. You've seen how hard it can be to get a sort that is I/O bound. (Sorting will tend to not be completely I/O bound, unless perhaps parallelism is used). Anyway, this patch (patch 0001-*) is by far the least important of the 3 that you and Claudio are signed up to review. I don't think it's worth bending over backwards to do better. If you're not comfortable with a simple cap like this, than I'd suggest that we leave it at that, since our time is better spent elsewhere. We can just shelve it for now -- "returned with feedback". I wouldn't make any noise about it (although, I actually don't think that the cap idea is at all controversial). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers