On Fri, Oct 27, 2017 at 12:24 AM, Rushabh Lathia <rushabh.lat...@gmail.com> wrote: > While re-basing the parallel-B-tree-index-build patch on top v22 patch > sets, found cosmetic review:
Thanks! > 1) BufFileSetEstimate is removed but it's still into buffile.h > > +extern size_t BufFileSetEstimate(int stripes); Fixed. > 2) BufFileSetCreate is renamed with BufFileSetInit, but used at below > place in comments: > > * Attach to a set of named BufFiles that was created with BufFileSetCreate. Fixed. Other minor tweaks: I fixed a harmless warning from Visual C++ and added a CHECK_FOR_INTERRUPTS() to ExecParallelHashJoinPartitionOuter()'s loop. Two other changes: 1. Improved concurrency for sharedtuplestore.c. Last week I investigated some test failures on AppVeyor CI and discovered a small problem with sharedtuplestore.c on Windows: it could occasionally get ERROR_ACCESS_DENIED errors when attempting to open files that were concurrently being unlinked (unlinking is not atomic on Windows, see pgsql-bugs #14243 for another manifestation). That code was a bit sloppy (though not incorrect), and was easy to fix by doing some checks in a different order, but... While hacking on that I realised that sharedtuplestore.c's parallel scan efficiency was pretty terrible anyway, so I made an improvement that I'd earlier threatened to make in a comment. Instead of holding a per-file lock while reading individual tuples, it now works in page-multiple-sized chunks. Readers only acquire a spinlock when they need a new chunk, don't hold any locks while doing IO, and never read overlapping pages. From a user perspective, this means that EXPLAIN (BUFFERS) temporary buffer read counts are approximately the same as for the equivalent non-parallel hash join, because each worker reads a disjoint set of temporary file pages instead of reading interleaved tuples from the same pages, and there is no more LWLock "shared tuplestore" that can show up in wait_event when backends pile up on the same batch. It writes very slightly more than it reads because of unread pages at the end of the final chunk (because it reads back in tuple-at-a-time and thus page-at-a-time, not whole chunk at a time -- I considered reading whole chunks and then returning pointer to MinimalTuples in the chunk, but that required MAXALIGNing data in the files on disk which made the files noticeably bigger). So, to summarise, there are now three layers of contention avoidance strategy being used by Parallel Hash Join for scanning batches in parallel: i) Participants in a Parallel Hash Join try to work on different batches so they avoid scanning the same SharedTuplestore completely. That's visible with EXPLAIN ANALYZE as "Batches Probed" (that shows the number of outer batches scanned; it doesn't seem worth the pixels to show "Batches Loaded" for the number of inner batches scanned which may be lower). ii) When that's not possible, they start out reading from different backing files by starting with the one they wrote themselves and then go around the full set. That means they don't contend on the per-file read-head lock until a reader drains a whole file and then choses another one that's still being read by someone else. iii) [New in this version] Since they might still finish up reading from the same file (and often do towards the end of a join), the tuples are chopped into multi-page chunks and participants are allocated chunks using a spinlock-protected counter. This is quite a lot like Parallel Sequential Scan, with some extra complications due to variable sized chunks. 2. Improved oversized tuple handling. I added a new regression test case to exercise the oversized tuple support in ExecParallelHashLoadTuple() and sts_puttuple(), as requested by Andres a while back. (Thanks to Andrew Gierth for a pointer on how to get detoasted tuples into a hash join table without disabling parallelism.) While testing that I realised that my defences against some degenerate behaviour with very small work_mem weren't quite good enough. Previously, I adjusted space_allowed to make sure every backend could allocate at least one memory chunk without triggering an increase in the number of batches. Now, I leave space_allowed alone but explicitly allow every backend to allocate at least one chunk ignoring the memory budget (whether regular chunk size or oversized tuple size), to avoid futile repeated batch increases when a single monster tuple is never going to fit in work_mem. A couple of stupid things outstanding: 1. EXPLAIN ANALYZE for Parallel Hash "actual" shows the complete row count, which is interesting to know (or not? maybe I should show it somewhere else?), but the costing shows the partial row estimate used for costing purposes. 2. The BufFileSet's temporary directory gets created even if you don't need it for batches. Duh. 3. I don't have a good query rescan regression query yet. I wish I could write my own query plans to test the executor. -- Thomas Munro http://www.enterprisedb.com
parallel-hash-v23.patchset.tgz
Description: GNU Zip compressed data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers