Re: [HACKERS] Parallel Bitmap Heap Scan - Prefetch pages are not updated properly
On Tue, Apr 4, 2017 at 6:24 AM, Dilip Kumar wrote: > While analyzing the coverage for the prefetching part, I found an > issue that prefetch_pages were not updated properly while executing in > parallel mode. > > Attached patch fixes the same. Wow, OK. Committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Mon, Mar 27, 2017 at 5:58 PM, Robert Haas wrote: > Failing to pass DSA_ALLOC_HUGE is an obvious oversight, but I think > the ptbase == NULL check shouldn't be needed, because we're not > passing DSA_ALLOC_NO_OOM. And that's good, because this is going to > be called from SH_CREATE, which isn't prepared for a NULL return > anyway. > > Am I all wet? Yes you are right that we are not passing DSA_ALLOC_NO_OOM, so dsa_allocate_extended will return error in case if allocation failed. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Mon, Mar 27, 2017 at 5:02 AM, Dilip Kumar wrote: > On Mon, Mar 27, 2017 at 12:53 PM, Rafia Sabih > wrote: >> Recently, on testing TPC-H 300 scale factor I stumbled on to a error >> for Q6, the test environment is as follows, >> work_mem = 1GB, >> shared_buffers = 10 GB, >> Effective_cache_size = 10GB >> random_page_cost = seq+page_cost =0.1 >> >> The error message is, ERROR: invalid DSA memory alloc request size >> 1610612740 >> In case required, stack trace is as follows, > > Thanks for reporting. In pagetable_allocate, DSA_ALLOC_HUGE flag were > missing while allocating the memory from the DSA. I have also handled > the NULL pointer return from dsa_get_address. > > Please verify with the attached patch. Failing to pass DSA_ALLOC_HUGE is an obvious oversight, but I think the ptbase == NULL check shouldn't be needed, because we're not passing DSA_ALLOC_NO_OOM. And that's good, because this is going to be called from SH_CREATE, which isn't prepared for a NULL return anyway. Am I all wet? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Mon, Mar 27, 2017 at 12:53 PM, Rafia Sabih wrote: > Recently, on testing TPC-H 300 scale factor I stumbled on to a error > for Q6, the test environment is as follows, > work_mem = 1GB, > shared_buffers = 10 GB, > Effective_cache_size = 10GB > random_page_cost = seq+page_cost =0.1 > > The error message is, ERROR: invalid DSA memory alloc request size 1610612740 > In case required, stack trace is as follows, Thanks for reporting. In pagetable_allocate, DSA_ALLOC_HUGE flag were missing while allocating the memory from the DSA. I have also handled the NULL pointer return from dsa_get_address. Please verify with the attached patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com bitmap_hugepage_fix.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Wed, Mar 8, 2017 at 11:52 PM, Robert Haas wrote: > On Wed, Mar 8, 2017 at 1:18 PM, Tom Lane wrote: >> Robert Haas writes: >>> What I'm using is: >> >>> Configured with: >>> --prefix=/Applications/Xcode.app/Contents/Developer/usr >>> --with-gxx-include-dir=/usr/include/c++/4.2.1 >>> Apple LLVM version 7.0.2 (clang-700.1.81) >>> Target: x86_64-apple-darwin14.5.0 >>> Thread model: posix >> >> Hm. I noticed that longfin didn't spit up on it either, despite having >> -Werror turned on. That's a slightly newer version, but still Apple's >> clang: >> >> $ gcc -v >> Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr >> --with-gxx-include-dir=/usr/include/c++/4.2.1 >> Apple LLVM version 8.0.0 (clang-800.0.42.1) >> Target: x86_64-apple-darwin16.4.0 >> Thread model: posix >> InstalledDir: >> /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin > > Yeah. I think on my previous MacBook Pro, you could do this without > generating a warning: > > int x; > printf("%d\n", x); > > The compiler on this one detects that case, but that seems to be about > as far as it goes. Recently, on testing TPC-H 300 scale factor I stumbled on to a error for Q6, the test environment is as follows, work_mem = 1GB, shared_buffers = 10 GB, Effective_cache_size = 10GB random_page_cost = seq+page_cost =0.1 The error message is, ERROR: invalid DSA memory alloc request size 1610612740 In case required, stack trace is as follows, #0 errfinish (dummy=0) at elog.c:415 #1 0x10738820 in elog_finish (elevel=20, fmt=0x109115d0 "invalid DSA memory alloc request size %zu") at elog.c:1378 #2 0x10778824 in dsa_allocate_extended (area=0x1001b53b138, size=1610612740, flags=4) at dsa.c:676 #3 0x103a3e60 in pagetable_allocate (pagetable=0x1001b52a590, size=1610612736) at tidbitmap.c:1521 #4 0x103a0488 in pagetable_grow (tb=0x1001b52a590, newsize=33554432) at ../../../src/include/lib/simplehash.h:379 #5 0x103a07b0 in pagetable_insert (tb=0x1001b52a590, key=34962730, found=0x3fffc705ba48 "") at ../../../src/include/lib/simplehash.h:504 #6 0x103a3354 in tbm_get_pageentry (tbm=0x1001b526fa0, pageno=34962730) at tidbitmap.c:1235 #7 0x103a1614 in tbm_add_tuples (tbm=0x1001b526fa0, tids=0x1001b51d22a, ntids=1, recheck=0 '\000') at tidbitmap.c:425 #8 0x100fdf24 in btgetbitmap (scan=0x1001b51c4a0, tbm=0x1001b526fa0) at nbtree.c:460 #9 0x100f2c48 in index_getbitmap (scan=0x1001b51c4a0, bitmap=0x1001b526fa0) at indexam.c:726 #10 0x1033c7d8 in MultiExecBitmapIndexScan (node=0x1001b51c180) at nodeBitmapIndexscan.c:91 #11 0x1031c0b4 in MultiExecProcNode (node=0x1001b51c180) at execProcnode.c:611 #12 0x1033a8a8 in BitmapHeapNext (node=0x1001b5187a8) at nodeBitmapHeapscan.c:143 #13 0x1032a094 in ExecScanFetch (node=0x1001b5187a8, accessMtd=0x1033a6c8 , recheckMtd=0x1033bab8 ) at execScan.c:95 #14 0x1032a194 in ExecScan (node=0x1001b5187a8, accessMtd=0x1033a6c8 , recheckMtd=0x1033bab8 ) at execScan.c:162 #15 0x1033bb88 in ExecBitmapHeapScan (node=0x1001b5187a8) at nodeBitmapHeapscan.c:667 Please feel free to ask if any more information is required for this. -- Regards, Rafia Sabih EnterpriseDB: http://www.enterprisedb.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Wed, Mar 8, 2017 at 1:18 PM, Tom Lane wrote: > Robert Haas writes: >> What I'm using is: > >> Configured with: >> --prefix=/Applications/Xcode.app/Contents/Developer/usr >> --with-gxx-include-dir=/usr/include/c++/4.2.1 >> Apple LLVM version 7.0.2 (clang-700.1.81) >> Target: x86_64-apple-darwin14.5.0 >> Thread model: posix > > Hm. I noticed that longfin didn't spit up on it either, despite having > -Werror turned on. That's a slightly newer version, but still Apple's > clang: > > $ gcc -v > Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr > --with-gxx-include-dir=/usr/include/c++/4.2.1 > Apple LLVM version 8.0.0 (clang-800.0.42.1) > Target: x86_64-apple-darwin16.4.0 > Thread model: posix > InstalledDir: > /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin Yeah. I think on my previous MacBook Pro, you could do this without generating a warning: int x; printf("%d\n", x); The compiler on this one detects that case, but that seems to be about as far as it goes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
Robert Haas writes: > What I'm using is: > Configured with: > --prefix=/Applications/Xcode.app/Contents/Developer/usr > --with-gxx-include-dir=/usr/include/c++/4.2.1 > Apple LLVM version 7.0.2 (clang-700.1.81) > Target: x86_64-apple-darwin14.5.0 > Thread model: posix Hm. I noticed that longfin didn't spit up on it either, despite having -Werror turned on. That's a slightly newer version, but still Apple's clang: $ gcc -v Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1 Apple LLVM version 8.0.0 (clang-800.0.42.1) Target: x86_64-apple-darwin16.4.0 Thread model: posix InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Wed, Mar 8, 2017 at 12:53 PM, Tom Lane wrote: >> Thanks. Sorry for the hassle; my compiler isn't as picky about this >> as I would like, and apparently Dilip's isn't either. > > Might be interesting to see whether -O level affects it. In principle, > whether you get the warning should depend on how much the compiler has > analyzed the logic flow ... What I'm using is: Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1 Apple LLVM version 7.0.2 (clang-700.1.81) Target: x86_64-apple-darwin14.5.0 Thread model: posix While I haven't experimented with this too extensively, my general impression is that this thing is extremely tolerant of uninitialized variables. I just tried compiling nodeBitmapHeapscan.c with -Wall -Werror and each of -O0, -O1, -O2, and -O3, and none of those produced any warnings. I've been reluctant to go to the hassle of installing a different compiler... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
Robert Haas writes: > On Wed, Mar 8, 2017 at 12:45 PM, Tom Lane wrote: >> Jeff Janes writes: >>> I'm getting some compiler warnings in gcc version 4.4.7 20120313 (Red Hat >>> 4.4.7-17) (GCC): >> Me too. Fix pushed. > Thanks. Sorry for the hassle; my compiler isn't as picky about this > as I would like, and apparently Dilip's isn't either. Might be interesting to see whether -O level affects it. In principle, whether you get the warning should depend on how much the compiler has analyzed the logic flow ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Wed, Mar 8, 2017 at 12:45 PM, Tom Lane wrote: > Jeff Janes writes: >> I'm getting some compiler warnings in gcc version 4.4.7 20120313 (Red Hat >> 4.4.7-17) (GCC): > > Me too. Fix pushed. Thanks. Sorry for the hassle; my compiler isn't as picky about this as I would like, and apparently Dilip's isn't either. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
Jeff Janes writes: > I'm getting some compiler warnings in gcc version 4.4.7 20120313 (Red Hat > 4.4.7-17) (GCC): Me too. Fix pushed. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Wed, Mar 8, 2017 at 9:08 AM, Robert Haas wrote: > On Wed, Mar 8, 2017 at 11:20 AM, Dilip Kumar > wrote: > > Right, done that way > > This didn't compile because you bobbled some code in > src/backend/nodes, but it was a trivial mistake so I fixed it. > > Committed with that fix and a bunch of minor cosmetic changes. > I'm getting some compiler warnings in gcc version 4.4.7 20120313 (Red Hat 4.4.7-17) (GCC): nodeBitmapHeapscan.c: In function 'BitmapHeapNext': nodeBitmapHeapscan.c:79: warning: 'tbmiterator' may be used uninitialized in this function nodeBitmapHeapscan.c:80: warning: 'shared_tbmiterator' may be used uninitialized in this function Cheers, Jeff
Re: [HACKERS] Parallel bitmap heap scan
On Wed, Mar 8, 2017 at 11:20 AM, Dilip Kumar wrote: > Right, done that way This didn't compile because you bobbled some code in src/backend/nodes, but it was a trivial mistake so I fixed it. Committed with that fix and a bunch of minor cosmetic changes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Wed, Mar 8, 2017 at 8:28 PM, Robert Haas wrote: > How about adding a regression test? Added > > bitmap_subplan_mark_shared could use castNode(), which seems like it > would be better style. Maybe some other places, too. > > + ParallelBitmapPopulate > + Waiting for the leader to populate the TidBitmap. > + > + > > If you build the documentation, you'll find that this doesn't come out > right; you need to add 1 to the value of the nearest preceding > "morerows". (I fixed a similar issue with 0001 while committing.) Fixed > > +/*--- > + * Check the current state > + * If state is > + * BM_INITIAL : We become the leader and set it to BM_INPROGRESS > + * BM_INPROGRESS : We need to wait till leader creates bitmap > + * BM_FINISHED : bitmap is ready so no need to wait > + *--- > > The formatting of this comment is slightly off - the comment for > BM_INITIAL isn't aligned the same as the others. But I would just > delete the whole comment, since more or less it recapitulates the > function header comment anyway. Removed. > > I wonder if BitmapShouldInitializeSharedState couldn't be written a > little more compactly overall, like this: > > { > SharedBitmapState state; > > while (1) > { > SpinLockAcquire(&pstate->mutex); > state = pstate->state; > if (pstate->state == BM_INITIAL) > pstate->state = BM_INPROGRESS; > SpinLockRelease(&pstate->mutex); > > /* If we are leader or leader has already created a TIDBITMAP */ > if (state != BM_INPROGRESS) > break; > > /* Sleep until leader finishes creating the bitmap */ > ConditionVariableSleep(&pstate->cv, WAIT_EVENT_PARALLEL_BITMAP_SCAN); > } > > ConditionVariableCancelSleep(); > > return (state == BM_INITIAL); > } This looks good, done this way > > +/* > + * By this time we have already populated the TBM and > + * initialized the shared iterators so set the state to > + * BM_FINISHED and wake up others. > + */ > +SpinLockAcquire(&pstate->mutex); > +pstate->state = BM_FINISHED; > +SpinLockRelease(&pstate->mutex); > +ConditionVariableBroadcast(&pstate->cv); > > I think it would be good to have a function for this, like > BitmapDoneInitializingSharedState(), and just call that function here. Done > > +SpinLockAcquire(&pstate->mutex); > + > +/* > + * Recheck under the mutex, If some other process has already > done > + * the enough prefetch then we need not to do anything. > + */ > +if (pstate->prefetch_pages >= pstate->prefetch_target) > +SpinLockRelease(&pstate->mutex); > +return; > +SpinLockRelease(&pstate->mutex); > > I think it would be clearer to write this as: > > SpinLockAcquire(&pstate->mutex); > do_prefetch = (pstate->prefetch_pages >= pstate->prefetch_target); > SpinLockRelease(&pstate->mutex); > if (!do_prefetch) >return; > > Then it's more obvious what's going on with the spinlock. But > actually what I would do is also roll in the increment to prefetch > pages in there, so that you don't have to reacquire the spinlock after > calling PrefetchBuffer: > > bool do_prefetch = false; > SpinLockAcquire(&pstate->mutex); > if (pstate->prefetch_pages < pstate->prefetch_target) > { > pstate->prefetch_pages++; > do_prefetch = true; > } > SpinLockRelease(&pstate->mutex); > > That seems like it will reduce the amount of excess prefetching > considerably, and also simplify the code and cut the spinlock > acquisitions by 50%. > Right, done that way > Overall I think this is in pretty good shape. Thanks. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com 0003-parallel-bitmap-heapscan-v9.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
Reviewing 0003: How about adding a regression test? bitmap_subplan_mark_shared could use castNode(), which seems like it would be better style. Maybe some other places, too. + ParallelBitmapPopulate + Waiting for the leader to populate the TidBitmap. + + If you build the documentation, you'll find that this doesn't come out right; you need to add 1 to the value of the nearest preceding "morerows". (I fixed a similar issue with 0001 while committing.) +/*--- + * Check the current state + * If state is + * BM_INITIAL : We become the leader and set it to BM_INPROGRESS + * BM_INPROGRESS : We need to wait till leader creates bitmap + * BM_FINISHED : bitmap is ready so no need to wait + *--- The formatting of this comment is slightly off - the comment for BM_INITIAL isn't aligned the same as the others. But I would just delete the whole comment, since more or less it recapitulates the function header comment anyway. I wonder if BitmapShouldInitializeSharedState couldn't be written a little more compactly overall, like this: { SharedBitmapState state; while (1) { SpinLockAcquire(&pstate->mutex); state = pstate->state; if (pstate->state == BM_INITIAL) pstate->state = BM_INPROGRESS; SpinLockRelease(&pstate->mutex); /* If we are leader or leader has already created a TIDBITMAP */ if (state != BM_INPROGRESS) break; /* Sleep until leader finishes creating the bitmap */ ConditionVariableSleep(&pstate->cv, WAIT_EVENT_PARALLEL_BITMAP_SCAN); } ConditionVariableCancelSleep(); return (state == BM_INITIAL); } +/* + * By this time we have already populated the TBM and + * initialized the shared iterators so set the state to + * BM_FINISHED and wake up others. + */ +SpinLockAcquire(&pstate->mutex); +pstate->state = BM_FINISHED; +SpinLockRelease(&pstate->mutex); +ConditionVariableBroadcast(&pstate->cv); I think it would be good to have a function for this, like BitmapDoneInitializingSharedState(), and just call that function here. +SpinLockAcquire(&pstate->mutex); + +/* + * Recheck under the mutex, If some other process has already done + * the enough prefetch then we need not to do anything. + */ +if (pstate->prefetch_pages >= pstate->prefetch_target) +SpinLockRelease(&pstate->mutex); +return; +SpinLockRelease(&pstate->mutex); I think it would be clearer to write this as: SpinLockAcquire(&pstate->mutex); do_prefetch = (pstate->prefetch_pages >= pstate->prefetch_target); SpinLockRelease(&pstate->mutex); if (!do_prefetch) return; Then it's more obvious what's going on with the spinlock. But actually what I would do is also roll in the increment to prefetch pages in there, so that you don't have to reacquire the spinlock after calling PrefetchBuffer: bool do_prefetch = false; SpinLockAcquire(&pstate->mutex); if (pstate->prefetch_pages < pstate->prefetch_target) { pstate->prefetch_pages++; do_prefetch = true; } SpinLockRelease(&pstate->mutex); That seems like it will reduce the amount of excess prefetching considerably, and also simplify the code and cut the spinlock acquisitions by 50%. Overall I think this is in pretty good shape. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Wed, Mar 8, 2017 at 6:42 PM, Robert Haas wrote: > I don't think I understand exactly why this system might be prone to a > little bit of extra prefetching - can you explain further? Let me explain with an example, suppose there are 2 workers prefetching jointly, lets assume prefetch_target is 3, so for example when both the workers prefetch 1 page each then prefetch_page count will be 2, so both the workers will go for next prefetch because prefetch_page is still less than prefetch target, so again both the workers will do prefetch and totally they will prefetch 4 pages. > I don't think it will hurt anything as long as we are talking about a small > number of extra prefetches here and there. I completely agree with this point and I mentioned in the mail so that it don't go unnoticed. And, whatever extra prefetch we have done we will anyway use it unless we stop the execution because of some limit clause. > >> Fixed > > Committed 0001. Thanks -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Wed, Mar 8, 2017 at 1:49 AM, Dilip Kumar wrote: > On Tue, Mar 7, 2017 at 10:10 PM, Dilip Kumar wrote: >>> It's not about speed. It's about not forgetting to prefetch. Suppose >>> that worker 1 becomes the prefetch worker but then doesn't return to >>> the Bitmap Heap Scan node for a long time because it's busy in some >>> other part of the plan tree. Now you just stop prefetching; that's >>> bad. You want prefetching to continue regardless of which workers are >>> busy doing what; as long as SOME worker is executing the parallel >>> bitmap heap scan, prefetching should continue as needed. >> >> Right, I missed this part. I will fix this. > I have fixed this part, after doing that I realised if multiple > processes are prefetching then it may be possible that in boundary > cases (e.g. suppose prefetch_target is 3 and prefetch_pages is at 2) > there may be some extra prefetch but finally those prefetched blocks > will be used. Another, solution to this problem is that we can > increase the prefetch_pages in advance then call tbm_shared_iterate, > this will avoid extra prefetch. But I am not sure what will be best > here. I don't think I understand exactly why this system might be prone to a little bit of extra prefetching - can you explain further? I don't think it will hurt anything as long as we are talking about a small number of extra prefetches here and there. > Fixed Committed 0001. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Tue, Mar 7, 2017 at 10:10 PM, Dilip Kumar wrote: >> It's not about speed. It's about not forgetting to prefetch. Suppose >> that worker 1 becomes the prefetch worker but then doesn't return to >> the Bitmap Heap Scan node for a long time because it's busy in some >> other part of the plan tree. Now you just stop prefetching; that's >> bad. You want prefetching to continue regardless of which workers are >> busy doing what; as long as SOME worker is executing the parallel >> bitmap heap scan, prefetching should continue as needed. > > Right, I missed this part. I will fix this. I have fixed this part, after doing that I realised if multiple processes are prefetching then it may be possible that in boundary cases (e.g. suppose prefetch_target is 3 and prefetch_pages is at 2) there may be some extra prefetch but finally those prefetched blocks will be used. Another, solution to this problem is that we can increase the prefetch_pages in advance then call tbm_shared_iterate, this will avoid extra prefetch. But I am not sure what will be best here. On Tue, Mar 7, 2017 at 9:41 PM, Robert Haas wrote: > +if (--ptbase->refcount == 0) > +dsa_free(dsa, istate->pagetable); > + > +if (istate->spages) > +{ > +ptpages = dsa_get_address(dsa, istate->spages); > +if (--ptpages->refcount == 0) > +dsa_free(dsa, istate->spages); > +} > +if (istate->schunks) > +{ > +ptchunks = dsa_get_address(dsa, istate->schunks); > +if (--ptchunks->refcount == 0) > +dsa_free(dsa, istate->schunks); > +} > > This doesn't involve any locking, which I think will happen to work > with the current usage pattern but doesn't seem very robust in > general. I think you either need the refcounts to be protected by a > spinlock, or maybe better, use pg_atomic_uint32 for them. You want > something like if (pg_atomic_sub_fetch_u32(&refcount, 1) == 0) { > dsa_free(...) } > > Otherwise, there's no guarantee it will get freed exactly once. Fixed On Tue, Mar 7, 2017 at 9:34 PM, Robert Haas wrote: > You've still got this: > > + if (DsaPointerIsValid(node->pstate->tbmiterator)) > + tbm_free_shared_area(dsa, node->pstate->tbmiterator); > + > + if (DsaPointerIsValid(node->pstate->prefetch_iterator)) > + dsa_free(dsa, node->pstate->prefetch_iterator); > > I'm trying to get to a point where both calls use > tbm_free_shared_area() - i.e. no peeking behind the abstraction layer. Fixed -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com 0001-tidbitmap-support-shared-v8.patch Description: Binary data 0003-parallel-bitmap-heapscan-v8.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Tue, Mar 7, 2017 at 10:07 PM, Robert Haas wrote: > It's not about speed. It's about not forgetting to prefetch. Suppose > that worker 1 becomes the prefetch worker but then doesn't return to > the Bitmap Heap Scan node for a long time because it's busy in some > other part of the plan tree. Now you just stop prefetching; that's > bad. You want prefetching to continue regardless of which workers are > busy doing what; as long as SOME worker is executing the parallel > bitmap heap scan, prefetching should continue as needed. Right, I missed this part. I will fix this. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Tue, Mar 7, 2017 at 11:27 AM, Dilip Kumar wrote: > On Tue, Mar 7, 2017 at 9:44 PM, Robert Haas wrote: >> I mean, IIUC, the call to PrefetchBuffer() is not done under any lock. >> And that's the slow part. The tiny amount of time we spend updating >> the prefetch information under the mutex should be insignificant >> compared to the cost of actually reading the buffer. Unless I'm >> missing something. > > Okay, but IIUC, the PrefetchBuffer is an async call to load the buffer > if it's not already in shared buffer? so If instead of one process is > making multiple async calls to PrefetchBuffer, if we make it by > multiple processes will it be any faster? Or you are thinking that at > least we can make BufTableLookup call parallel because that is not an > async call. It's not about speed. It's about not forgetting to prefetch. Suppose that worker 1 becomes the prefetch worker but then doesn't return to the Bitmap Heap Scan node for a long time because it's busy in some other part of the plan tree. Now you just stop prefetching; that's bad. You want prefetching to continue regardless of which workers are busy doing what; as long as SOME worker is executing the parallel bitmap heap scan, prefetching should continue as needed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Tue, Mar 7, 2017 at 9:44 PM, Robert Haas wrote: > I mean, IIUC, the call to PrefetchBuffer() is not done under any lock. > And that's the slow part. The tiny amount of time we spend updating > the prefetch information under the mutex should be insignificant > compared to the cost of actually reading the buffer. Unless I'm > missing something. Okay, but IIUC, the PrefetchBuffer is an async call to load the buffer if it's not already in shared buffer? so If instead of one process is making multiple async calls to PrefetchBuffer, if we make it by multiple processes will it be any faster? Or you are thinking that at least we can make BufTableLookup call parallel because that is not an async call. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Tue, Mar 7, 2017 at 11:09 AM, Dilip Kumar wrote: > On Tue, Mar 7, 2017 at 9:34 PM, Robert Haas wrote: >> >> + if (DsaPointerIsValid(node->pstate->tbmiterator)) >> + tbm_free_shared_area(dsa, node->pstate->tbmiterator); >> + >> + if (DsaPointerIsValid(node->pstate->prefetch_iterator)) >> + dsa_free(dsa, node->pstate->prefetch_iterator); > > As per latest code, both should be calling to tbm_free_shared_area > because tbm_free_shared_area is capable of handling that. My silly > mistake. I will fix it. Thanks. I don't think I believe this rationale: + /* +* If one of the process has already identified that we need +* to do prefetch then let it perform the prefetch and allow +* others to proceed with the work in hand. Another option +* could be that we allow all of them to participate in +* prefetching. But, most of this work done under mutex or +* LWLock so ultimately we may end up in prefetching +* sequentially. +*/ I mean, IIUC, the call to PrefetchBuffer() is not done under any lock. And that's the slow part. The tiny amount of time we spend updating the prefetch information under the mutex should be insignificant compared to the cost of actually reading the buffer. Unless I'm missing something. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
(On Tue, Feb 28, 2017 at 10:48 AM, Dilip Kumar wrote: > 0001- same as previous with some changes for freeing the shared memory stuff. +if (--ptbase->refcount == 0) +dsa_free(dsa, istate->pagetable); + +if (istate->spages) +{ +ptpages = dsa_get_address(dsa, istate->spages); +if (--ptpages->refcount == 0) +dsa_free(dsa, istate->spages); +} +if (istate->schunks) +{ +ptchunks = dsa_get_address(dsa, istate->schunks); +if (--ptchunks->refcount == 0) +dsa_free(dsa, istate->schunks); +} This doesn't involve any locking, which I think will happen to work with the current usage pattern but doesn't seem very robust in general. I think you either need the refcounts to be protected by a spinlock, or maybe better, use pg_atomic_uint32 for them. You want something like if (pg_atomic_sub_fetch_u32(&refcount, 1) == 0) { dsa_free(...) } Otherwise, there's no guarantee it will get freed exactly once. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Tue, Mar 7, 2017 at 9:34 PM, Robert Haas wrote: > > + if (DsaPointerIsValid(node->pstate->tbmiterator)) > + tbm_free_shared_area(dsa, node->pstate->tbmiterator); > + > + if (DsaPointerIsValid(node->pstate->prefetch_iterator)) > + dsa_free(dsa, node->pstate->prefetch_iterator); As per latest code, both should be calling to tbm_free_shared_area because tbm_free_shared_area is capable of handling that. My silly mistake. I will fix it. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Mon, Mar 6, 2017 at 12:35 AM, Dilip Kumar wrote: > On Thu, Mar 2, 2017 at 6:52 PM, Robert Haas wrote: >> 0002 wasn't quite careful enough about the placement of #ifdef >> USE_PREFETCH, but otherwise looks OK. Committed after changing that >> and getting rid of the local variable prefetch_iterator, which seemed >> to be adding rather than removing complexity after this refactoring. > > 0003 is rebased after this commit. You've still got this: + if (DsaPointerIsValid(node->pstate->tbmiterator)) + tbm_free_shared_area(dsa, node->pstate->tbmiterator); + + if (DsaPointerIsValid(node->pstate->prefetch_iterator)) + dsa_free(dsa, node->pstate->prefetch_iterator); I'm trying to get to a point where both calls use tbm_free_shared_area() - i.e. no peeking behind the abstraction layer. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Thu, Mar 2, 2017 at 6:52 PM, Robert Haas wrote: > 0002 wasn't quite careful enough about the placement of #ifdef > USE_PREFETCH, but otherwise looks OK. Committed after changing that > and getting rid of the local variable prefetch_iterator, which seemed > to be adding rather than removing complexity after this refactoring. 0003 is rebased after this commit. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com 0003-parallel-bitmap-heapscan-v7.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Tue, Feb 28, 2017 at 9:18 PM, Dilip Kumar wrote: > 0001- same as previous with some changes for freeing the shared memory stuff. > 0002- nodeBitmapHeapScan refactoring, this applies independently > 0003- actual parallel bitmap stuff applies on top of 0001 and 0002 0002 wasn't quite careful enough about the placement of #ifdef USE_PREFETCH, but otherwise looks OK. Committed after changing that and getting rid of the local variable prefetch_iterator, which seemed to be adding rather than removing complexity after this refactoring. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Sun, Feb 26, 2017 at 9:26 PM, Robert Haas wrote: > I'm not entirely sure about the calling convention for > tbm_free_shared_area() but the rest seems OK. Changed > >> 2. Now tbm_free is not freeing any of the shared members which can be >> accessed by other worker so tbm_free is safe to call from >> ExecEndBitmapHeapScan without any safety check or ref count. > > That also seems fine. We ended up with something very similar in the > Parallel Index Scan patch. > >> 0002: >> 1. We don't need ExecShutdownBitmapHeapScan anymore because now we are >> not freeing any shared member in ExecEndBitmapHeapScan. >> 2. In ExecReScanBitmapHeapScan we will call tbm_free_shared_area to >> free the shared members of the TBM. >> 3. After that, we will free TBMSharedIteratorState what we allocated >> using tbm_prepare_shared_iterate. > > Check. But I think tbm_free_shared_area() should also free the object > itself, instead of making the caller do that separately. Right, done that way. > > +if (DsaPointerIsValid(node->pstate->tbmiterator)) > +{ > +/* First we free the shared TBM members using the shared state */ > +tbm_free_shared_area(dsa, node->pstate->tbmiterator); > +dsa_free(dsa, node->pstate->tbmiterator); > +} > + > +if (DsaPointerIsValid(node->pstate->prefetch_iterator)) > +dsa_free(dsa, node->pstate->prefetch_iterator); > > The fact that these cases aren't symmetric suggests that your > abstraction is leaky. I'm guessing that you can't call > tbm_free_shared_area because the two iterators share one copy of the > underlying iteration arrays, and the TBM code isn't smart enough to > avoid freeing them twice. You're going to have to come up with a > better solution to that problem; nodeBitmapHeapScan.c shouldn't know > about the way the underlying storage details are managed. (Maybe you > need to reference-count the iterator arrays?) Converted iterator arrays to structure and maintained the refcount. I had to do the same thing for pagetable also because that is also shared across iterator. > > +if (node->inited) > +goto start_iterate; > > My first programming teacher told me not to use goto. I've > occasionally violated that rule, but I need a better reason than you > have here. It looks very easy to avoid. Changed > > +pbms_set_parallel(outerPlanState(node)); > > I think this should be a flag in the plan, and the planner should set > it correctly, instead of having it be a flag in the executor that the > executor sets. Also, the flag itself should really be called > something that involves the word "shared" rather than "parallel", > because the bitmap will not be created in parallel, but it will be > shared. Done > > Have you checked whether this patch causes any regression in the > non-parallel case? It adds a bunch of "if" statements, so it might. > Hopefully not, but it needs to be carefully tested. With new patch I tested in my local machine, perform multiple executions and it doesn't show any regression. Attached file perf_result contains the explain analyze output for one of the query (head, patch with 0 workers and patch with 2 workers). I will perform the test with all the TPC-H queries which using bitmap plan on the bigger machine and will post the results next week. > > @@ -48,10 +48,11 @@ > #include "utils/snapmgr.h" > #include "utils/tqual.h" > > - > static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node); > > Unnecessary. Fixed > > +static bool pbms_is_leader(ParallelBitmapState *pbminfo); > +static void pbms_set_parallel(PlanState *node); > > I don't think this "pbms" terminology is very good. It's dissimilar > to the way other functions in this file are named. Maybe > BitmapShouldInitializeSharedState(). Changed > > I think that some of the bits that this function makes conditional on > pstate should be factored out into inline functions. Like: > > -if (node->prefetch_target >= node->prefetch_maximum) > - /* don't increase any further */ ; > -else if (node->prefetch_target >= node->prefetch_maximum / 2) > -node->prefetch_target = node->prefetch_maximum; > -else if (node->prefetch_target > 0) > -node->prefetch_target *= 2; > -else > -node->prefetch_target++; > +if (!pstate) > +{ > +if (node->prefetch_target >= node->prefetch_maximum) > + /* don't increase any further */ ; > +else if (node->prefetch_target >= node->prefetch_maximum / 2) > +node->prefetch_target = node->prefetch_maximum; > +else if (node->prefetch_target > 0) > +node->prefetch_target *= 2; > +else > +node->prefetch_target++; > +} > +else if (pstate->prefetch_target < node->prefetch_maximum) > +
Re: [HACKERS] Parallel bitmap heap scan
On Mon, Feb 27, 2017 at 10:30 AM, Dilip Kumar wrote: > On Sun, Feb 26, 2017 at 9:26 PM, Robert Haas wrote: >> tbm_free_shared_area because the two iterators share one copy of the >> underlying iteration arrays, and the TBM code isn't smart enough to >> avoid freeing them twice. You're going to have to come up with a >> better solution to that problem; nodeBitmapHeapScan.c shouldn't know >> about the way the underlying storage details are managed. (Maybe you >> need to reference-count the iterator arrays?) > > Yeah, I also think current way doesn't look so clean, currently, these > arrays are just integers array, may be we can use a first slot of the > array for reference-count? or convert to the structure which has space > for reference-count and an integers array. What do you suggest? Maybe something like typedef struct { int refcnt; SomeType somename[FLEXIBLE_ARRAY_MEMBER]; } SomeOtherType; would be a good approach. >> +if (node->inited) >> +goto start_iterate; >> >> My first programming teacher told me not to use goto. I've >> occasionally violated that rule, but I need a better reason than you >> have here. It looks very easy to avoid. > > Yes, this can be avoided, I was just trying to get rid of multi-level > if nesting and end up with the "goto". That's what I figured. >> +pbms_set_parallel(outerPlanState(node)); >> >> I think this should be a flag in the plan, and the planner should set >> it correctly, instead of having it be a flag in the executor that the >> executor sets. Also, the flag itself should really be called >> something that involves the word "shared" rather than "parallel", >> because the bitmap will not be created in parallel, but it will be >> shared. > > Earlier, I thought that it will be not a good idea to set that flag in > BitmapIndexScan path because the same path can be used either under > ParallelBitmapHeapPath or under normal BitmapHeapPath. But, now after > putting some thought, I realised that we can do that in create_plan. > Therefore, I will change this. Cool. >> pbms_is_leader() looks horrifically inefficient. Every worker will >> reacquire the spinlock for every tuple. You should only have to enter >> this spinlock-acquiring loop for the first tuple. After that, either >> you became the leader, did the initialization, and set the state to >> PBM_FINISHED, or you waited until the pre-existing leader did the >> same. You should have a backend-local flag that keeps you from >> touching the spinlock for every tuple. I wouldn't be surprised if >> fixing this nets a noticeable performance increase for this patch at >> high worker counts. > > I think there is some confusion, if you notice, below code will avoid > calling pbms_is_leader for every tuple. > It will be called only first time. And, after that node->inited will > be true and it will directly jump to start_iterate for subsequent > calls. Am I missing something? > >> +if (node->inited) >> +goto start_iterate; Oh, OK. I guess I was just confused, then. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Sun, Feb 26, 2017 at 9:26 PM, Robert Haas wrote: Thanks for the review, I will work on these. There are some comments I need suggestions. > tbm_free_shared_area because the two iterators share one copy of the > underlying iteration arrays, and the TBM code isn't smart enough to > avoid freeing them twice. You're going to have to come up with a > better solution to that problem; nodeBitmapHeapScan.c shouldn't know > about the way the underlying storage details are managed. (Maybe you > need to reference-count the iterator arrays?) Yeah, I also think current way doesn't look so clean, currently, these arrays are just integers array, may be we can use a first slot of the array for reference-count? or convert to the structure which has space for reference-count and an integers array. What do you suggest? > > +if (node->inited) > +goto start_iterate; > > My first programming teacher told me not to use goto. I've > occasionally violated that rule, but I need a better reason than you > have here. It looks very easy to avoid. Yes, this can be avoided, I was just trying to get rid of multi-level if nesting and end up with the "goto". > > +pbms_set_parallel(outerPlanState(node)); > > I think this should be a flag in the plan, and the planner should set > it correctly, instead of having it be a flag in the executor that the > executor sets. Also, the flag itself should really be called > something that involves the word "shared" rather than "parallel", > because the bitmap will not be created in parallel, but it will be > shared. Earlier, I thought that it will be not a good idea to set that flag in BitmapIndexScan path because the same path can be used either under ParallelBitmapHeapPath or under normal BitmapHeapPath. But, now after putting some thought, I realised that we can do that in create_plan. Therefore, I will change this. > > Have you checked whether this patch causes any regression in the > non-parallel case? It adds a bunch of "if" statements, so it might. > Hopefully not, but it needs to be carefully tested. During the initial patch development, I have tested this aspect also but never published any results for the same. I will do that along with my next patch and post the results. > > pbms_is_leader() looks horrifically inefficient. Every worker will > reacquire the spinlock for every tuple. You should only have to enter > this spinlock-acquiring loop for the first tuple. After that, either > you became the leader, did the initialization, and set the state to > PBM_FINISHED, or you waited until the pre-existing leader did the > same. You should have a backend-local flag that keeps you from > touching the spinlock for every tuple. I wouldn't be surprised if > fixing this nets a noticeable performance increase for this patch at > high worker counts. I think there is some confusion, if you notice, below code will avoid calling pbms_is_leader for every tuple. It will be called only first time. And, after that node->inited will be true and it will directly jump to start_iterate for subsequent calls. Am I missing something? > +if (node->inited) > +goto start_iterate; -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Tue, Feb 21, 2017 at 10:20 AM, Dilip Kumar wrote: > 0001: > 1. I have got a new interface, "tbm_free_shared_area(dsa_area *dsa, > dsa_pointer dp)" which will be responsible for freeing all the shared > members (pagetable, ptpage and ptchunk). Actually, we can not do this > in tbm_free itself because the only leader will have a local tbm with > reference to all these pointers and our parallel bitmap leader may not > necessarily be the actual backend. I'm not entirely sure about the calling convention for tbm_free_shared_area() but the rest seems OK. > 2. Now tbm_free is not freeing any of the shared members which can be > accessed by other worker so tbm_free is safe to call from > ExecEndBitmapHeapScan without any safety check or ref count. That also seems fine. We ended up with something very similar in the Parallel Index Scan patch. > 0002: > 1. We don't need ExecShutdownBitmapHeapScan anymore because now we are > not freeing any shared member in ExecEndBitmapHeapScan. > 2. In ExecReScanBitmapHeapScan we will call tbm_free_shared_area to > free the shared members of the TBM. > 3. After that, we will free TBMSharedIteratorState what we allocated > using tbm_prepare_shared_iterate. Check. But I think tbm_free_shared_area() should also free the object itself, instead of making the caller do that separately. +if (DsaPointerIsValid(node->pstate->tbmiterator)) +{ +/* First we free the shared TBM members using the shared state */ +tbm_free_shared_area(dsa, node->pstate->tbmiterator); +dsa_free(dsa, node->pstate->tbmiterator); +} + +if (DsaPointerIsValid(node->pstate->prefetch_iterator)) +dsa_free(dsa, node->pstate->prefetch_iterator); The fact that these cases aren't symmetric suggests that your abstraction is leaky. I'm guessing that you can't call tbm_free_shared_area because the two iterators share one copy of the underlying iteration arrays, and the TBM code isn't smart enough to avoid freeing them twice. You're going to have to come up with a better solution to that problem; nodeBitmapHeapScan.c shouldn't know about the way the underlying storage details are managed. (Maybe you need to reference-count the iterator arrays?) +if (node->inited) +goto start_iterate; My first programming teacher told me not to use goto. I've occasionally violated that rule, but I need a better reason than you have here. It looks very easy to avoid. +pbms_set_parallel(outerPlanState(node)); I think this should be a flag in the plan, and the planner should set it correctly, instead of having it be a flag in the executor that the executor sets. Also, the flag itself should really be called something that involves the word "shared" rather than "parallel", because the bitmap will not be created in parallel, but it will be shared. Have you checked whether this patch causes any regression in the non-parallel case? It adds a bunch of "if" statements, so it might. Hopefully not, but it needs to be carefully tested. @@ -48,10 +48,11 @@ #include "utils/snapmgr.h" #include "utils/tqual.h" - static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node); Unnecessary. +static bool pbms_is_leader(ParallelBitmapState *pbminfo); +static void pbms_set_parallel(PlanState *node); I don't think this "pbms" terminology is very good. It's dissimilar to the way other functions in this file are named. Maybe BitmapShouldInitializeSharedState(). I think that some of the bits that this function makes conditional on pstate should be factored out into inline functions. Like: -if (node->prefetch_target >= node->prefetch_maximum) - /* don't increase any further */ ; -else if (node->prefetch_target >= node->prefetch_maximum / 2) -node->prefetch_target = node->prefetch_maximum; -else if (node->prefetch_target > 0) -node->prefetch_target *= 2; -else -node->prefetch_target++; +if (!pstate) +{ +if (node->prefetch_target >= node->prefetch_maximum) + /* don't increase any further */ ; +else if (node->prefetch_target >= node->prefetch_maximum / 2) +node->prefetch_target = node->prefetch_maximum; +else if (node->prefetch_target > 0) +node->prefetch_target *= 2; +else +node->prefetch_target++; +} +else if (pstate->prefetch_target < node->prefetch_maximum) +{ +SpinLockAcquire(&pstate->prefetch_mutex); +if (pstate->prefetch_target >= node->prefetch_maximum) + /* don't increase any further */ ; +else if (pstate->prefetch_target >= + node->prefetch_maximum / 2) +pstate->prefetch_target = node->prefetch_maximum;
Re: [HACKERS] Parallel bitmap heap scan
On Mon, Feb 20, 2017 at 11:18 PM, Robert Haas wrote: > On Mon, Feb 20, 2017 at 5:55 AM, Thomas Munro > wrote: >> On Sun, Feb 19, 2017 at 9:59 PM, Dilip Kumar wrote: >>> So basically, what I want to propose is that Only during >>> ExecReScanBitmapHeapScan we can free all the DSA pointers because at >>> that time we can be sure that all the workers have completed there >>> task and we are safe to free. (And we don't free any DSA memory at >>> ExecEndBitmapHeapScan). >> >> I think this works. > > OK. In my latest version of the patch, I have fixed it as described above i.e only cleanup in ExecReScanBitmapHeapScan. For getting this there is some change in both the patches. 0001: 1. I have got a new interface, "tbm_free_shared_area(dsa_area *dsa, dsa_pointer dp)" which will be responsible for freeing all the shared members (pagetable, ptpage and ptchunk). Actually, we can not do this in tbm_free itself because the only leader will have a local tbm with reference to all these pointers and our parallel bitmap leader may not necessarily be the actual backend. If we want to get this done using tbm, then we need to create a local tbm in each worker and get the shared member reference copied into tbm using tbm_attach_shared_iterate like we were doing in the earlier version. 2. Now tbm_free is not freeing any of the shared members which can be accessed by other worker so tbm_free is safe to call from ExecEndBitmapHeapScan without any safety check or ref count. 0002: 1. We don't need ExecShutdownBitmapHeapScan anymore because now we are not freeing any shared member in ExecEndBitmapHeapScan. 2. In ExecReScanBitmapHeapScan we will call tbm_free_shared_area to free the shared members of the TBM. 3. After that, we will free TBMSharedIteratorState what we allocated using tbm_prepare_shared_iterate. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com 0001-tidbitmap-support-shared-v5.patch Description: Binary data 0002-parallel-bitmap-heapscan-v5.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
Robert Haas writes: > On Mon, Feb 20, 2017 at 6:19 AM, Tom Lane wrote: >> The thing that you really have to worry about for this kind of proposal >> is "what if the query errors out and we never get to ExecEndNode"? >> It's particularly nasty if you're talking about parallel queries where >> maybe only one or some of the processes involved detect an error. > I think that's not actually a problem, because we've already got code > to make sure that all DSM resources associated with the query get > blown away in that case. Of course, that code might have bugs, but if > it does, I think it's better to try to fix those bugs than to insert > some belt-and-suspenders mechanism for reclaiming every possible chunk > of memory in retail fashion, just like we blow up es_query_cxt rather > than trying to pfree allocations individually. Actually, I think we're saying the same thing: rely on the general DSM cleanup mechanism, don't insert extra stuff that you expect will get done by executor shutdown. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Mon, Feb 20, 2017 at 6:19 AM, Tom Lane wrote: > Thomas Munro writes: >> One practical problem that came up was the need for executor nodes to >> get a chance to do that kind of cleanup before the DSM segment is >> detached. In my patch series I introduced a new node API >> ExecNodeDetach to allow for that. Andres objected that the need for >> that is evidence that the existing protocol is broken and should be >> fixed instead. I'm looking into that. > > The thing that you really have to worry about for this kind of proposal > is "what if the query errors out and we never get to ExecEndNode"? > It's particularly nasty if you're talking about parallel queries where > maybe only one or some of the processes involved detect an error. I think that's not actually a problem, because we've already got code to make sure that all DSM resources associated with the query get blown away in that case. Of course, that code might have bugs, but if it does, I think it's better to try to fix those bugs than to insert some belt-and-suspenders mechanism for reclaiming every possible chunk of memory in retail fashion, just like we blow up es_query_cxt rather than trying to pfree allocations individually. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Mon, Feb 20, 2017 at 5:55 AM, Thomas Munro wrote: > On Sun, Feb 19, 2017 at 9:59 PM, Dilip Kumar wrote: >> So basically, what I want to propose is that Only during >> ExecReScanBitmapHeapScan we can free all the DSA pointers because at >> that time we can be sure that all the workers have completed there >> task and we are safe to free. (And we don't free any DSA memory at >> ExecEndBitmapHeapScan). > > I think this works. OK. > Some hand-wavy thoughts on this topic in the context of hash joins: > > The argument for cleaning up sooner rather than later would be that it > could reduce the total peak memory usage of large execution plans. Is > that a reasonable goal and can we achieve it? I suspect the answer is > yes in theory but no in practice, and we don't even try to achieve it > in non-parallel queries as far as I know. We're pretty stupid about causing nodes to stop eating up resources as early as we could; for example, when a Limit is filled, we don't make any attempt to have scans underneath it release pins or memory or anything else. But we don't usually let the same node consume memory multiple times. ExecReScanBitmapHeapScan frees all of the memory used for the previous bitmap in the non-parallel case, so it should probably do that in the parallel case also. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Sun, Feb 19, 2017 at 10:34 PM, Robert Haas wrote: > Yeah, but it looks like ExecReScanGather gets rid of the workers, but > reuses the existing DSM. I'm not quite sure what happens to the DSA. > It looks like it probably just hangs around from the previous > iteration, which means that any allocations will also hang around. Yes right, they hang around. But, during rescan (ExecReScanBitmapHeapScan) we can free all these DSA pointers ? That mean before reallocating the DSA pointers we would have already got rid of the old ones. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
Thomas Munro writes: > One practical problem that came up was the need for executor nodes to > get a chance to do that kind of cleanup before the DSM segment is > detached. In my patch series I introduced a new node API > ExecNodeDetach to allow for that. Andres objected that the need for > that is evidence that the existing protocol is broken and should be > fixed instead. I'm looking into that. The thing that you really have to worry about for this kind of proposal is "what if the query errors out and we never get to ExecEndNode"? It's particularly nasty if you're talking about parallel queries where maybe only one or some of the processes involved detect an error. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Sun, Feb 19, 2017 at 10:34 PM, Robert Haas wrote: > On Sun, Feb 19, 2017 at 9:59 PM, Dilip Kumar wrote: >> I can imagine it can get executed over and over if plan is something like >> below. >> >> NestLoopJoin >> -> SeqScan >> -> Gather >> -> Parallel Bitmap Heap Scan >> >> But in such case every time the Inner node of the NLJ will be >> rescanned i.e. Gather will be rescanned which in turn shutdown >> workers. > > Yeah, but it looks like ExecReScanGather gets rid of the workers, but > reuses the existing DSM. I'm not quite sure what happens to the DSA. > It looks like it probably just hangs around from the previous > iteration, which means that any allocations will also hang around. Yes, it hangs around. Being able to reuse state in a rescan is a feature: you might be able to reuse a hash table or a bitmap. In the Parallel Shared Hash patch, the last participant to detach from the shared hash table barrier gets a different return code and runs some cleanup code. The alternative was to make the leader wait for the workers to finish accessing the hash table so that it could do that. I had it that way in an early version, but my goal is to minimise synchronisation points so now it's just 'last to leave turns out the lights' with no waiting. One practical problem that came up was the need for executor nodes to get a chance to do that kind of cleanup before the DSM segment is detached. In my patch series I introduced a new node API ExecNodeDetach to allow for that. Andres objected that the need for that is evidence that the existing protocol is broken and should be fixed instead. I'm looking into that. On Sun, Feb 19, 2017 at 9:59 PM, Dilip Kumar wrote: > So basically, what I want to propose is that Only during > ExecReScanBitmapHeapScan we can free all the DSA pointers because at > that time we can be sure that all the workers have completed there > task and we are safe to free. (And we don't free any DSA memory at > ExecEndBitmapHeapScan). I think this works. I also grappled a bit with the question of whether it's actually worth trying to free DSA memory when you're finished with it, eating precious CPU cycles at end of a join, or just letting the the executor's DSA area get nuked at end of parallel execution. As you say, there is a special case for rescans to avoid leaks. I described this as a potential approach in a TODO note in my v5 patch, but currently my code just does the clean-up every time on the grounds that it's simple and hasn't shown up as a performance problem yet. Some hand-wavy thoughts on this topic in the context of hash joins: The argument for cleaning up sooner rather than later would be that it could reduce the total peak memory usage of large execution plans. Is that a reasonable goal and can we achieve it? I suspect the answer is yes in theory but no in practice, and we don't even try to achieve it in non-parallel queries as far as I know. My understanding is that PostgreSQL's planner can generate left-deep, bushy and right-deep hash join plans: N nested left-deep hash joins: Each hash join is on the outer side of its parent, supplying tuples to probe the parent hash table. Their probe phases overlap, so all N hash tables must exist and be fully loaded at the same time. N nested right-deep hash joins: Each hash join is on the inner side of its parent, supplying tuples to build the hash table of its parent. Theoretically you only need two full hash tables in memory at peak, because you'll finish probing each one while build its parent's hash table and then not need the child's hash table again (unless you need to rescan). N nested bushy hash joins: Somewhere in between. But we don't actually take advantage of that opportunity to reduce peak memory today. We always have N live hash tables and don't free them until standard_ExecutorEnd runs ExecProcEnd on the top of the plan. Perhaps we could teach hash tables to free themselves ASAP at the end of their probe phase unless they know a rescan is possible. But that just opens a whole can of worms: if we care about total peak memory usage, should it become a planner goal that might favour right-deep hash joins? I guess similar questions must arise for bitmap heap scan and anything else holding memory that it doesn't technically need anymore, and parallel query doesn't really change anything about the situation, except maybe that Gather nodes provide a point of scoping somewhere in between 'eager destruction' and 'hog all the space until end of plan' which makes things a bit better. I don't know anywhere near enough about query planners to say whether such ideas about planning are reasonable, and am quite aware that it's difficult terrain, and I have other fish to fry, so I'm going to put down the can opener and back away. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.
Re: [HACKERS] Parallel bitmap heap scan
On Sun, Feb 19, 2017 at 9:59 PM, Dilip Kumar wrote: > I can imagine it can get executed over and over if plan is something like > below. > > NestLoopJoin > -> SeqScan > -> Gather > -> Parallel Bitmap Heap Scan > > But in such case every time the Inner node of the NLJ will be > rescanned i.e. Gather will be rescanned which in turn shutdown > workers. Yeah, but it looks like ExecReScanGather gets rid of the workers, but reuses the existing DSM. I'm not quite sure what happens to the DSA. It looks like it probably just hangs around from the previous iteration, which means that any allocations will also hang around. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Sun, Feb 19, 2017 at 7:44 PM, Robert Haas wrote: > It's probably OK if tbm_free() doesn't free the memory allocated from > DSA, and we just let cleanup at end of query do it. However, that > could cause some trouble if the Parallel Bitmap Heap Scan gets > executed over and over and keeps allocating more and more memory from > DSA. Is it possible that Parallel Bitmap Heap Scan will be executed multiple time without shutting down the Workers? I can imagine it can get executed over and over if plan is something like below. NestLoopJoin -> SeqScan -> Gather -> Parallel Bitmap Heap Scan But in such case every time the Inner node of the NLJ will be rescanned i.e. Gather will be rescanned which in turn shutdown workers. So basically, what I want to propose is that Only during ExecReScanBitmapHeapScan we can free all the DSA pointers because at that time we can be sure that all the workers have completed there task and we are safe to free. (And we don't free any DSA memory at ExecEndBitmapHeapScan). I think the way to fix that would be to maintain a reference > count that starts at 1 when the iterator arrays are created and gets > incremented every time a TBMSharedIteratorState is created. It gets > decremented when the TIDBitmap is destroyed that has iterator arrays > is destroyed, and each time a TBMSharedIteratorState is destroyed. > When it reaches 0, the process that reduces the reference count to 0 > calls dsa_free on the DSA pointers for pagetable, spages, and schunks. > (Also, if a TIDBitmap is freed before iteration begins, it frees the > DSA pointer for the pagetable only; the others won't have values yet.) -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Sun, Feb 19, 2017 at 7:18 PM, Dilip Kumar wrote: > I have observed one problem with 0002 and I though of sharing that > before fixing the same because we might have encountered the same > problem in some other patches i.e parallel shared hash and there might > be already a way to handle that. > > The problem is that In ExecEndBitmapHeapScan we do tbm_free. Wherein, > it frees local pagetable memory and the shared memory using callback > routine and if other than the Backend (actual backend for the client > which is executing gather node) node any other worker become the > leader (worker which is responsible for creating the shared TBM) then > it will finish it work and free the shared memory by calling > ExecEndBitmapHeapScan, and it's possible that other worker are still > operating on the shared memory. > > So far I never observed this issue in our test may be because either > Backend become the leader or by the time leader (non backend) free the > TBM others also finishes there work. > > Solution: > - One solution to this problem can be that leader should not complete > the scan unless all other worker have finished the work. > - We can also think of that we don't make anyone wait but we make a > arrangement that last worker to finish the bitmapscan only free the > memory, but one problem with this solution is that last worker can be > non-leader also, which will have access to shared memory but how to > free the pagetable local memory (it's local to the leader). > > Any suggestion on this ? It's probably OK if tbm_free() doesn't free the memory allocated from DSA, and we just let cleanup at end of query do it. However, that could cause some trouble if the Parallel Bitmap Heap Scan gets executed over and over and keeps allocating more and more memory from DSA. I think the way to fix that would be to maintain a reference count that starts at 1 when the iterator arrays are created and gets incremented every time a TBMSharedIteratorState is created. It gets decremented when the TIDBitmap is destroyed that has iterator arrays is destroyed, and each time a TBMSharedIteratorState is destroyed. When it reaches 0, the process that reduces the reference count to 0 calls dsa_free on the DSA pointers for pagetable, spages, and schunks. (Also, if a TIDBitmap is freed before iteration begins, it frees the DSA pointer for the pagetable only; the others won't have values yet.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Sun, Feb 19, 2017 at 7:18 PM, Dilip Kumar wrote: > On Sat, Feb 18, 2017 at 10:45 PM, Dilip Kumar wrote: >> in 0002: >> - Improved comments. >> - Code refactoring in BitmapHeapNext. >> - Removed local tbm creation in BitmapHeapNext : as per new tidbitmap >> it's of no use. > > I have observed one problem with 0002 and I though of sharing that > before fixing the same because we might have encountered the same > problem in some other patches i.e parallel shared hash and there might > be already a way to handle that. > > The problem is that In ExecEndBitmapHeapScan we do tbm_free. Wherein, > it frees local pagetable memory and the shared memory using callback > routine and if other than the Backend (actual backend for the client > which is executing gather node) node any other worker become the > leader (worker which is responsible for creating the shared TBM) then > it will finish it work and free the shared memory by calling > ExecEndBitmapHeapScan, and it's possible that other worker are still > operating on the shared memory. > So far I never observed this issue in our test may be because either > Backend become the leader or by the time leader (non backend) free the > TBM others also finishes there work. > > Solution: > - One solution to this problem can be that leader should not complete > the scan unless all other worker have finished the work. For Parallel Seq Scan, we do wait for parallel workers to finish before freeing the shared memory. Why the case is different here? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Sat, Feb 18, 2017 at 10:45 PM, Dilip Kumar wrote: > in 0002: > - Improved comments. > - Code refactoring in BitmapHeapNext. > - Removed local tbm creation in BitmapHeapNext : as per new tidbitmap > it's of no use. I have observed one problem with 0002 and I though of sharing that before fixing the same because we might have encountered the same problem in some other patches i.e parallel shared hash and there might be already a way to handle that. The problem is that In ExecEndBitmapHeapScan we do tbm_free. Wherein, it frees local pagetable memory and the shared memory using callback routine and if other than the Backend (actual backend for the client which is executing gather node) node any other worker become the leader (worker which is responsible for creating the shared TBM) then it will finish it work and free the shared memory by calling ExecEndBitmapHeapScan, and it's possible that other worker are still operating on the shared memory. So far I never observed this issue in our test may be because either Backend become the leader or by the time leader (non backend) free the TBM others also finishes there work. Solution: - One solution to this problem can be that leader should not complete the scan unless all other worker have finished the work. - We can also think of that we don't make anyone wait but we make a arrangement that last worker to finish the bitmapscan only free the memory, but one problem with this solution is that last worker can be non-leader also, which will have access to shared memory but how to free the pagetable local memory (it's local to the leader). Any suggestion on this ? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Fri, Feb 17, 2017 at 2:01 AM, Robert Haas wrote: > + * in order to share relptrs of the chunk and the pages arrays and other > + * TBM related information with other workers. > > No relptrs any more. Fixed > > +dsa_pointer dsapagetable;/* dsa_pointer to the element array */ > +dsa_pointer dsapagetableold;/* dsa_pointer to the old element array > */ > +dsa_area *dsa;/* reference to per-query dsa area */ > +dsa_pointer ptpages;/* dsa_pointer to the page array */ > +dsa_pointer ptchunks;/* dsa_pointer to the chunk array */ > > Let's put the DSA pointer first and then the other stuff after it. > That seems more logical. Done that way > > +typedef struct TBMSharedIteratorState > +{ > +intspageptr;/* next spages index */ > +intschunkptr;/* next schunks index */ > +intschunkbit;/* next bit to check in current schunk > */ > +LWLocklock;/* lock to protect above members */ > +dsa_pointer pagetable;/* dsa pointers to head of pagetable data > */ > +dsa_pointer spages;/* dsa pointer to page array */ > +dsa_pointer schunks;/* dsa pointer to chunk array */ > +intnentries;/* number of entries in pagetable */ > +intmaxentries;/* limit on same to meet maxbytes */ > +intnpages;/* number of exact entries in > pagetable */ > +intnchunks;/* number of lossy entries in pagetable */ > +} TBMSharedIteratorState; > > I think you've got this largely backwards from the most logical order. > Let's order it like this: nentries, maxentries, npages, nchunks, > pagetable, spages, schunks, lock (to protect BELOW members), spageptr, > chunkptr, schunkbit. Done > > +struct TBMSharedIterator > +{ > +PagetableEntry *ptbase;/* pointer to the pagetable element array > */ > +dsa_area *dsa;/* reference to per-query dsa area */ > +int *ptpages;/* sorted exact page index list */ > +int *ptchunks;/* sorted lossy page index list */ > +TBMSharedIteratorState *state;/* shared members */ > +TBMIterateResult output;/* MUST BE LAST (because variable-size) */ > +}; > > Do we really need "dsa" here when it's already present in the shared > state? It doesn't seem like we even use it for anything. It's needed > to create each backend's TBMSharedIterator, but not afterwards, I > think. Right, removed. > > In terms of ordering, I'd move "state" to the top of the structure, > just like "tbm" comes first in a TBMIterator. Yeah, that looks better. Done that way. > > + * total memory consumption. If dsa passed to this function is not NULL > + * then the memory for storing elements of the underlying page table will > + * be allocated from the DSA. > > Notice that you capitalized "DSA" in two different ways in the same > sentence; I'd go for the all-caps version. Also, you need the word > "the" before the first one. Fixed, all such instances. > > +if (tbm->status == TBM_HASH && (tbm->iterating == TBM_NOT_ITERATING)) > > Excess parentheses. Fixed > > + * tbm_prepare_shared_iterate - prepare to iterator through a TIDBitmap > + * by multiple workers using shared iterator. > > Prepare to iterate, not prepare to iterator. I suggest rephrasing > this as "prepare shared iteration state for a TIDBitmap". Fixed. > > + * The TBMSharedIteratorState will be allocated from DSA so that multiple > + * worker can attach to it and iterate jointly. > > Maybe change to "The necessary shared state will be allocated from the > DSA passed to tbm_create, so that multiple workers can attach to it > and iterate jointly". Done. > > + * This will convert the pagetable hash into page and chunk array of the > index > + * into pagetable array so that multiple workers can use it to get the actual > + * page entry. > > I think you can leave off everything starting from "so that". It's > basically redundant with what you already said. Done > > +dsa_pointer iterator; > +TBMSharedIteratorState *iterator_state; > > These aren't really great variable names, because the latter isn't the > state associated with the former. They're both the same object. > Maybe just "dp" and "istate". Done > > I think this function should Assert(tbm->dsa != NULL) and > Assert(tbm->iterating != TBM_ITERATING_PRIVATE), and similarly > tbm_begin_iterate should Assert(tbm->iterating != TBM_ITERATE_SHARED). Done > > +/* > + * If we have a hashtable, create and fill the sorted page lists, unless > + * we already did that for a previous iterator. Note that the lists are > + * attached to the bitmap not the iterator, so they can be used by more > + * than one iterator. However, we keep dsa_pointer to these in the > shared > + * iterator so that other workers can access them directly. >
Re: [HACKERS] Parallel bitmap heap scan
On Thu, Feb 16, 2017 at 10:54 AM, Dilip Kumar wrote: > interface_dsa_allocate0.patch : Provides new interface dsa_allocate0, > which is used by 0001 Committed. I moved the function prototype for dsa_allocate0() next to the existing prototype for dsa_allocate(). Please try to think about things like this when you add functions or prototypes or structure members: put them in a natural place where they are close to related things, not just wherever happens to be convenient. > gather_shutdown_childeren_first.patch : Processing the child node This same problem came up on the Parallel Hash thread. I mentioned this proposed fix over there; let's see what he (or anyone else) has to say about it. > 0001: tidbimap change to provide the interfaces for shared iterator. + * in order to share relptrs of the chunk and the pages arrays and other + * TBM related information with other workers. No relptrs any more. +dsa_pointer dsapagetable;/* dsa_pointer to the element array */ +dsa_pointer dsapagetableold;/* dsa_pointer to the old element array */ +dsa_area *dsa;/* reference to per-query dsa area */ +dsa_pointer ptpages;/* dsa_pointer to the page array */ +dsa_pointer ptchunks;/* dsa_pointer to the chunk array */ Let's put the DSA pointer first and then the other stuff after it. That seems more logical. +typedef struct TBMSharedIteratorState +{ +intspageptr;/* next spages index */ +intschunkptr;/* next schunks index */ +intschunkbit;/* next bit to check in current schunk */ +LWLocklock;/* lock to protect above members */ +dsa_pointer pagetable;/* dsa pointers to head of pagetable data */ +dsa_pointer spages;/* dsa pointer to page array */ +dsa_pointer schunks;/* dsa pointer to chunk array */ +intnentries;/* number of entries in pagetable */ +intmaxentries;/* limit on same to meet maxbytes */ +intnpages;/* number of exact entries in pagetable */ +intnchunks;/* number of lossy entries in pagetable */ +} TBMSharedIteratorState; I think you've got this largely backwards from the most logical order. Let's order it like this: nentries, maxentries, npages, nchunks, pagetable, spages, schunks, lock (to protect BELOW members), spageptr, chunkptr, schunkbit. +struct TBMSharedIterator +{ +PagetableEntry *ptbase;/* pointer to the pagetable element array */ +dsa_area *dsa;/* reference to per-query dsa area */ +int *ptpages;/* sorted exact page index list */ +int *ptchunks;/* sorted lossy page index list */ +TBMSharedIteratorState *state;/* shared members */ +TBMIterateResult output;/* MUST BE LAST (because variable-size) */ +}; Do we really need "dsa" here when it's already present in the shared state? It doesn't seem like we even use it for anything. It's needed to create each backend's TBMSharedIterator, but not afterwards, I think. In terms of ordering, I'd move "state" to the top of the structure, just like "tbm" comes first in a TBMIterator. + * total memory consumption. If dsa passed to this function is not NULL + * then the memory for storing elements of the underlying page table will + * be allocated from the DSA. Notice that you capitalized "DSA" in two different ways in the same sentence; I'd go for the all-caps version. Also, you need the word "the" before the first one. +if (tbm->status == TBM_HASH && (tbm->iterating == TBM_NOT_ITERATING)) Excess parentheses. + * tbm_prepare_shared_iterate - prepare to iterator through a TIDBitmap + * by multiple workers using shared iterator. Prepare to iterate, not prepare to iterator. I suggest rephrasing this as "prepare shared iteration state for a TIDBitmap". + * The TBMSharedIteratorState will be allocated from DSA so that multiple + * worker can attach to it and iterate jointly. Maybe change to "The necessary shared state will be allocated from the DSA passed to tbm_create, so that multiple workers can attach to it and iterate jointly". + * This will convert the pagetable hash into page and chunk array of the index + * into pagetable array so that multiple workers can use it to get the actual + * page entry. I think you can leave off everything starting from "so that". It's basically redundant with what you already said. +dsa_pointer iterator; +TBMSharedIteratorState *iterator_state; These aren't really great variable names, because the latter isn't the state associated with the former. They're both the same object. Maybe just "dp" and "istate". I think this function should Assert(tbm->dsa != NULL) and Assert(tbm->iterating != TBM_ITERATING_PRIVATE), and similarly tbm_begin_iterate should Assert(tbm->iterating != TBM_ITERATE_SHARED). +/* + * If
Re: [HACKERS] Parallel bitmap heap scan
On Tue, Feb 14, 2017 at 10:18 PM, Robert Haas wrote: > What is the point of having a TBMSharedIterator contain a TIDBitmap > pointer? All the values in that TIDBitmap are populated from the > TBMSharedInfo, but it seems to me that the fields that are copied over > unchanged (like status and nchunks) could just be used directly from > the TBMSharedInfo, and the fields that are computed (like relpages and > relchunks) could be stored directly in the TBMSharedIterator. > tbm_shared_iterate() is already separate code from tbm_iterate(), so > it can be changed to refer to whichever data structure contains the > data it needs. Done > > Why is TBMSharedInfo separate from TBMSharedIteratorState? It seems > like you could merge those two into a single structure. Done > > I think we can simplify things here a bit by having shared iterators > not support single-page mode. In the backend-private case, > tbm_begin_iterate() really doesn't need to do anything with the > pagetable in the TBM_ONE_PAGE case, but for a shared iterator, we've > got to anyway copy the pagetable into shared memory. So it seems to > me that it would be simpler just to transform it into a standard > iteration array while we're at it, instead of copying it into entry1. > In other words, I suggest removing both "entry1" and "status" from > TBMSharedInfo and making tbm_prepare_shared_iterate smarter to > compensate. Done > > I think "dsa_data" is poorly named; it should be something like > "dsapagetable" in TIDBitmap and just "pagetable" in TBMSharedInfo. I > think you should should move the "base", "relpages", and "relchunks" > into TBMSharedIterator and give them better names, like "ptbase", > "ptpages" and "ptchunks". "base" isn't clear that we're talking about > the pagetable's base as opposed to anything else, and "relpages" and > "relchunks" are named based on the fact that the pointers are relative > rather than named based on what data they point at, which doesn't seem > like the right choice. Done > > I suggest putting the parallel functions next to each other in the > file: tbm_begin_iterate(), tbm_prepare_shared_iterate(), > tbm_iterate(), tbm_shared_iterate(), tbm_end_iterate(), > tbm_end_shared_iterate(). Done > > +if (tbm->dsa == NULL) > +return pfree(pointer); > > Don't try to return a value of type void. The correct spelling of > this is { pfree(pointer); return; }. Formatted appropriately across 4 > lines, of course. Fixed > > +/* > + * If TBM is in iterating phase that means pagetable is already created > + * and we have come here during tbm_free. By this time we are already > + * detached from the DSA because the GatherNode would have been shutdown. > + */ > +if (tbm->iterating) > +return; > > This seems like something of a kludge, although not a real easy one to > fix. The problem here is that tidbitmap.c ideally shouldn't have to > know that it's being used by the executor or that there's a Gather > node involved, and certainly not the order of operations around > shutdown. It should really be the problem of 0002 to handle this kind > of problem, without 0001 having to know anything about it. It strikes > me that it would probably be a good idea to have Gather clean up its > children before it gets cleaned up itself. So in ExecShutdownNode, we > could do something like this: > > diff --git a/src/backend/executor/execProcnode.c > b/src/backend/executor/execProcnode.c > index 0dd95c6..5ccc2e8 100644 > --- a/src/backend/executor/execProcnode.c > +++ b/src/backend/executor/execProcnode.c > @@ -815,6 +815,8 @@ ExecShutdownNode(PlanState *node) > if (node == NULL) > return false; > > +planstate_tree_walker(node, ExecShutdownNode, NULL); > + > switch (nodeTag(node)) > { > case T_GatherState: > @@ -824,5 +826,5 @@ ExecShutdownNode(PlanState *node) > break; > } > > -return planstate_tree_walker(node, ExecShutdownNode, NULL); > +return false; > } > > Also, in ExecEndGather, something like this: > > diff --git a/src/backend/executor/nodeGather.c > b/src/backend/executor/nodeGather.c > index a1a3561..32c97d3 100644 > --- a/src/backend/executor/nodeGather.c > +++ b/src/backend/executor/nodeGather.c > @@ -229,10 +229,10 @@ ExecGather(GatherState *node) > void > ExecEndGather(GatherState *node) > { > +ExecEndNode(outerPlanState(node)); /* let children clean up first */ > ExecShutdownGather(node); > ExecFreeExprContext(&node->ps); > ExecClearTuple(node->ps.ps_ResultTupleSlot); > -ExecEndNode(outerPlanState(node)); > } > > /* > > With those changes and an ExecShutdownBitmapHeapScan() called from > ExecShutdownNode(), it should be possible (I think) for us to always > have the bitmap heap scan node shut down before the Gather node shuts > down, which I think would let you avoid having a special case for this > inside the TBM code. Done (gather_shutdown_children_first.patch does what y
Re: [HACKERS] Parallel bitmap heap scan
On Wed, Feb 15, 2017 at 7:17 AM, Dilip Kumar wrote: > This is easily doable and it will look much cleaner. While doing this > I am facing one problem related to > relptr_store and relptr_access. The problem is that when I call > "relptr_store (base, rp, val)" with both base and val as a same > address then it will store offset as 0 which is fine, but when we use > relptr_access with 0 offset it is returning NULL instead of giving the > actual address. I expect it should return directly base when offset is > 0. (I am facing this problem because in this case (TBM_ONE_PAGE), > there will be only one page and address of that will be same as > base.). > > There can be multiple solutions to this but none of them looks cleaner to me. > sol1: If relptr_access return NULL then directly use the base address > as our PagetableEntry. > sol2: Instead of using base as start of the element array, just try to > use some modified base as e.g base=base - some number. > sol3: change relptr_access to not return NULL in this case. But, this > will change the current behaviour of this interface and need to > analyze the side effects. Hmm, yeah, that's a problem. How about not using relative pointers here, and instead just using array indexes? >> I don't see why you think you need to add sizeof(dsa_pointer) to the >> allocation and store an extra copy of the dsa_pointer in the >> additional space. You are already storing it in tbm->dsa_data and >> that seems good enough. pagetable_free() needs the value, but it has >> a pointer to the TIDBitmap and could just pass tbm->dsa_data directly >> instead of fishing the pointer out of the extra space. > > If you see the code of SH_GROW, first it needs to allocate the bigger > chunk copy data from smaller chunk to the bigger chunk and then free > the smaller chunk. Oh, I see. > So by the time it call the pagetable_free, it would have already > called the pagetable_allocate for the newer bigger chunk of memory so > now, tbm->dsa_data points to the latest memory, but pagetable_free > wants to free older one. > > One solution to this could be that I keep two dsa_pointer in TBM, one > point to old memory and another to new. (After this here we will get > the same problem of relptr_access because now we will have first entry > pointer is same as base pointer) Yes, two dsa_pointers seems fine. Maybe that's not totally beautiful, but it seems better than what you had before. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Tue, Feb 14, 2017 at 10:18 PM, Robert Haas wrote: > Thanks, the external interface to this looks much cleaner now. > Scrutinizing the internals: Thanks for the comments, I am working on these. I have few doubts for some of the comments. > I suggest removing both "entry1" and "status" from > TBMSharedInfo and making tbm_prepare_shared_iterate smarter to > compensate. This is easily doable and it will look much cleaner. While doing this I am facing one problem related to relptr_store and relptr_access. The problem is that when I call "relptr_store (base, rp, val)" with both base and val as a same address then it will store offset as 0 which is fine, but when we use relptr_access with 0 offset it is returning NULL instead of giving the actual address. I expect it should return directly base when offset is 0. (I am facing this problem because in this case (TBM_ONE_PAGE), there will be only one page and address of that will be same as base.). The question can be why we did not face this issue while converting the pagetable to page array, because at least one entry will be there which should have the same address as the base. But coincidently we did not get that problem because in that case as of now we were storing dsa_pointer in the head of the element array, therefore first pagetable element was not same as the base. There can be multiple solutions to this but none of them looks cleaner to me. sol1: If relptr_access return NULL then directly use the base address as our PagetableEntry. sol2: Instead of using base as start of the element array, just try to use some modified base as e.g base=base - some number. sol3: change relptr_access to not return NULL in this case. But, this will change the current behaviour of this interface and need to analyze the side effects. > > I don't see why you think you need to add sizeof(dsa_pointer) to the > allocation and store an extra copy of the dsa_pointer in the > additional space. You are already storing it in tbm->dsa_data and > that seems good enough. pagetable_free() needs the value, but it has > a pointer to the TIDBitmap and could just pass tbm->dsa_data directly > instead of fishing the pointer out of the extra space. If you see the code of SH_GROW, first it needs to allocate the bigger chunk copy data from smaller chunk to the bigger chunk and then free the smaller chunk. So by the time it call the pagetable_free, it would have already called the pagetable_allocate for the newer bigger chunk of memory so now, tbm->dsa_data points to the latest memory, but pagetable_free wants to free older one. One solution to this could be that I keep two dsa_pointer in TBM, one point to old memory and another to new. (After this here we will get the same problem of relptr_access because now we will have first entry pointer is same as base pointer) Please provide your thought. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Tue, Feb 14, 2017 at 12:18 AM, Dilip Kumar wrote: > Fixed. Thanks, the external interface to this looks much cleaner now. Scrutinizing the internals: What is the point of having a TBMSharedIterator contain a TIDBitmap pointer? All the values in that TIDBitmap are populated from the TBMSharedInfo, but it seems to me that the fields that are copied over unchanged (like status and nchunks) could just be used directly from the TBMSharedInfo, and the fields that are computed (like relpages and relchunks) could be stored directly in the TBMSharedIterator. tbm_shared_iterate() is already separate code from tbm_iterate(), so it can be changed to refer to whichever data structure contains the data it needs. Why is TBMSharedInfo separate from TBMSharedIteratorState? It seems like you could merge those two into a single structure. I think we can simplify things here a bit by having shared iterators not support single-page mode. In the backend-private case, tbm_begin_iterate() really doesn't need to do anything with the pagetable in the TBM_ONE_PAGE case, but for a shared iterator, we've got to anyway copy the pagetable into shared memory. So it seems to me that it would be simpler just to transform it into a standard iteration array while we're at it, instead of copying it into entry1. In other words, I suggest removing both "entry1" and "status" from TBMSharedInfo and making tbm_prepare_shared_iterate smarter to compensate. I think "dsa_data" is poorly named; it should be something like "dsapagetable" in TIDBitmap and just "pagetable" in TBMSharedInfo. I think you should should move the "base", "relpages", and "relchunks" into TBMSharedIterator and give them better names, like "ptbase", "ptpages" and "ptchunks". "base" isn't clear that we're talking about the pagetable's base as opposed to anything else, and "relpages" and "relchunks" are named based on the fact that the pointers are relative rather than named based on what data they point at, which doesn't seem like the right choice. I suggest putting the parallel functions next to each other in the file: tbm_begin_iterate(), tbm_prepare_shared_iterate(), tbm_iterate(), tbm_shared_iterate(), tbm_end_iterate(), tbm_end_shared_iterate(). +if (tbm->dsa == NULL) +return pfree(pointer); Don't try to return a value of type void. The correct spelling of this is { pfree(pointer); return; }. Formatted appropriately across 4 lines, of course. +/* + * If TBM is in iterating phase that means pagetable is already created + * and we have come here during tbm_free. By this time we are already + * detached from the DSA because the GatherNode would have been shutdown. + */ +if (tbm->iterating) +return; This seems like something of a kludge, although not a real easy one to fix. The problem here is that tidbitmap.c ideally shouldn't have to know that it's being used by the executor or that there's a Gather node involved, and certainly not the order of operations around shutdown. It should really be the problem of 0002 to handle this kind of problem, without 0001 having to know anything about it. It strikes me that it would probably be a good idea to have Gather clean up its children before it gets cleaned up itself. So in ExecShutdownNode, we could do something like this: diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c index 0dd95c6..5ccc2e8 100644 --- a/src/backend/executor/execProcnode.c +++ b/src/backend/executor/execProcnode.c @@ -815,6 +815,8 @@ ExecShutdownNode(PlanState *node) if (node == NULL) return false; +planstate_tree_walker(node, ExecShutdownNode, NULL); + switch (nodeTag(node)) { case T_GatherState: @@ -824,5 +826,5 @@ ExecShutdownNode(PlanState *node) break; } -return planstate_tree_walker(node, ExecShutdownNode, NULL); +return false; } Also, in ExecEndGather, something like this: diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c index a1a3561..32c97d3 100644 --- a/src/backend/executor/nodeGather.c +++ b/src/backend/executor/nodeGather.c @@ -229,10 +229,10 @@ ExecGather(GatherState *node) void ExecEndGather(GatherState *node) { +ExecEndNode(outerPlanState(node)); /* let children clean up first */ ExecShutdownGather(node); ExecFreeExprContext(&node->ps); ExecClearTuple(node->ps.ps_ResultTupleSlot); -ExecEndNode(outerPlanState(node)); } /* With those changes and an ExecShutdownBitmapHeapScan() called from ExecShutdownNode(), it should be possible (I think) for us to always have the bitmap heap scan node shut down before the Gather node shuts down, which I think would let you avoid having a special case for this inside the TBM code. +char *ptr; +dsaptr = dsa_allocate(tbm->dsa, size + sizeof(dsa_pointer)); +tbm->dsa_data = dsaptr; +ptr = dsa_get_address(tbm->dsa, dsaptr); +memset(ptr, 0, size + sizeof(ds
Re: [HACKERS] Parallel bitmap heap scan
On Tue, Feb 14, 2017 at 6:51 AM, Haribabu Kommi wrote: > +#if SIZEOF_DSA_POINTER == 4 > +typedef uint32 dsa_pointer; > +#else > +typedef uint64 dsa_pointer; > +#endif > > I feel the declaration of the above typdef can be moved into the > section above if we going with the current move into postgres.h > file. Robert has already given the patch[1] for the same so now don't need to do anything for this. Therefore, I included the dsa.h in tidbitmap.h. [1] https://www.postgresql.org/message-id/CA%2BTgmoZ%3DF%3DGkxV0YEv-A8tb%2BAEGy_Qa7GSiJ8deBKFATnzfEug%40mail.gmail.com > > +/* > + * tbm_alloc_shared > + * > + * Callback function for allocating the memory for hashtable elements. > + * It allocates memory from DSA if tbm holds a reference to a dsa. > + */ > +static inline void * > +pagetable_allocate(pagetable_hash *pagetable, Size size) > > Function name and comments mismatch? Fixed. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com 0001-tidbitmap-support-shared-v2.patch Description: Binary data 0002-parallel-bitmap-heapscan-v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Mon, Feb 13, 2017 at 8:48 AM, Dilip Kumar wrote: > On Mon, Feb 13, 2017 at 6:24 PM, Robert Haas wrote: >> I don't think it's acceptable (or necessary) to move the DSA >> definitions into postgres.h. Why do you think you need to do that, >> vs. just including dsa.h in a few more places? > > I need to access dsa_pointer in tidbitmap.h, which is included from > FRONTEND as well. Now, problem is that dsa.h is including #include > "port/atomics.h", but atomic.h can not be included if FRONTEND is > defined. > > #ifndef ATOMICS_H > #define ATOMICS_H > #ifdef FRONTEND > #error "atomics.h may not be included from frontend code" > #endif > > Is there any other solution to this ? Well, any problem like this generally has a bunch of solutions, so I'll say yes. I spent a good chunk of today studying the issue and started a new thread devoted specifically to it: https://www.postgresql.org/message-id/CA%2BTgmoZ%3DF%3DGkxV0YEv-A8tb%2BAEGy_Qa7GSiJ8deBKFATnzfEug%40mail.gmail.com -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Tue, Feb 14, 2017 at 12:48 AM, Dilip Kumar wrote: > On Mon, Feb 13, 2017 at 6:24 PM, Robert Haas > wrote: > > I don't think it's acceptable (or necessary) to move the DSA > > definitions into postgres.h. Why do you think you need to do that, > > vs. just including dsa.h in a few more places? > > I need to access dsa_pointer in tidbitmap.h, which is included from > FRONTEND as well. Now, problem is that dsa.h is including #include > "port/atomics.h", but atomic.h can not be included if FRONTEND is > defined. > > #ifndef ATOMICS_H > #define ATOMICS_H > #ifdef FRONTEND > #error "atomics.h may not be included from frontend code" > #endif > > Is there any other solution to this ? How about creating another header file with the parallel changes and include it only in necessary places? Following are my observations, while going through the patch. +#if SIZEOF_DSA_POINTER == 4 +typedef uint32 dsa_pointer; +#else +typedef uint64 dsa_pointer; +#endif I feel the declaration of the above typdef can be moved into the section above if we going with the current move into postgres.h file. +/* + * tbm_alloc_shared + * + * Callback function for allocating the memory for hashtable elements. + * It allocates memory from DSA if tbm holds a reference to a dsa. + */ +static inline void * +pagetable_allocate(pagetable_hash *pagetable, Size size) Function name and comments mismatch? Regards, Hari Babu Fujitsu Australia
Re: [HACKERS] Parallel bitmap heap scan
On Mon, Feb 13, 2017 at 6:24 PM, Robert Haas wrote: > I don't think it's acceptable (or necessary) to move the DSA > definitions into postgres.h. Why do you think you need to do that, > vs. just including dsa.h in a few more places? I need to access dsa_pointer in tidbitmap.h, which is included from FRONTEND as well. Now, problem is that dsa.h is including #include "port/atomics.h", but atomic.h can not be included if FRONTEND is defined. #ifndef ATOMICS_H #define ATOMICS_H #ifdef FRONTEND #error "atomics.h may not be included from frontend code" #endif Is there any other solution to this ? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Sat, Feb 11, 2017 at 1:41 AM, Dilip Kumar wrote: > I tried my best, please check the latest patch (0001). I don't think it's acceptable (or necessary) to move the DSA definitions into postgres.h. Why do you think you need to do that, vs. just including dsa.h in a few more places? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Wed, Feb 8, 2017 at 8:07 AM, Robert Haas wrote: Thanks for the detailed review and your valuable feedback. > I think it would be useful to break the remaining patch into two > parts, one that enhances the tidbitmap.c API and another that uses > that to implement Parallel Bitmap Heap Scan. I have done that. > There are several cosmetic problems here. You may have noticed that > all function prototypes in PostgreSQL header files are explicitly > declared extern; yours should be, too. Also, there is extra > whitespace before some of the variable names here, like > "ParallelTBMInfo * tbminfo" instead of "ParallelTBMInfo *tbminfo". If > that is due to pgindent, the solution is to add the typedef names to > your local typedef list. Also, tbm_restore_shared_members doesn't > actually exist. Fixed > 1. Add an additional argument to tbm_create(), dsa_area *dsa. If it's > NULL, we allocate a backend-private memory for the hash entries as > normal. If it's true, we use the dsa_area to store the hash entries, > using the infrastructure added by your 0002 and revised in > c3c4f6e1740be256178cd7551d5b8a7677159b74. (You can use a flag in the > BitmapIndexScan and BitmapOr to decide whether to pass NULL or > es_query_dsa.) Done > > 2. Add a new function dsa_pointer tbm_prepare_shared_iterate(TIDBitmap > *tbm) which allocates shared iteration arrays using the DSA passed to > tbm_create() and returns a dsa_pointer to the result. Arrange this so > that if it's called more than once, each call returns a separate > iterator, so that you can call it once to get the main iterator and a > second time for the prefetch iterator, but have both of those point to > the same underlying iteration arrays. > > 3. Add a new function TBMSharedIterator > *tbm_attach_shared_iterate(dsa_area *dsa, dsa_pointer) which is called > once per backend and gets passed the dsa_pointer from the previous > step. > > 4. Add a new function TBMIterateResult > *tbm_shared_iterate(TBMSharedIterator *) to fetch the next result. I have tried to get these three API's as you explained with one change. In tbm_attach_shared_iterate I need to pass TBM also because each worker will create their own copy of TBM. Those workers need to get the TBM-related information from the shared location. Even though I try to access some of the information like chunk, npages from directly shared location, but some other information like base pointer to dsa element, RelptrPagetableEntry etc. should be local to each worker (after conversion from dsa pointer to local pointer). > > As compared with your proposal, this involves only 3 functions instead > of 7 (plus one new argument to another function), and I think it's a > lot clearer what those functions are actually doing. You don't need > tbm_estimate_parallel_tbminfo() any more because the data being passed > from one backend to another is precisely a dsa_pointer, and the bitmap > scan can just leave space for that in its own estimator function. You > don't need tbm_update_shared_members() separately from > tbm_begin_shared_iterate() separately from tbm_init_shared_iterator() > because tbm_prepare_shared_iterate() can do all that stuff. You don't > need tbm_set_parallel() because the additional argument to > tbm_create() takes care of that need. Right > > The way you've got ParallelTBMInfo set up right now doesn't respect > the separation of concerns between nodeBitmapHeapscan.c and > tidbitmap.c properly. tidbitmap.c shouldn't know that the caller > intends on creating two iterators, one of which is for prefetching. > The design above hopefully addresses that: each call to > tbm_prepare_shared_iterate returns a dsa_pointer to a separate chunk > of shared iterator state, but tidbitmap.c does not need to know how > many times that will be called. Done > > Apart from the above, this patch will need a rebase over today's > commits, Done and please make sure all functions you add have high-quality > header comments. I tried my best, please check the latest patch (0001). Apart from those, there are some more changes. 1. I have moved the dsa_pointer and dsa_area declaration from dsa.h to postgres .h, an independent patch is attached for the same. Because we need to declare function headers with dsa_pointer in tidbitmap.h, but tidbitmap.h also used in FRONTEND, therefore, this can not include dsa.h (as it contains atomics.h). 2. I noticed there was one defect in my code related to handling the TBM_ONE_PAGE, In initial version, there was no problem, but it got introduced somewhere in intermediate versions. I have fixed the same. There were two option to do that 1. Don’t switch to TBM_ONE_PAGE in case of parallel mode (ideally this can not happen only worst estimation can get us there) and directly got to TBM_HASH 2. In shared information keep space for sharing a PagetableEntry. I have implemented 2nd (in the initial versions I implemented with 1st). Note: Patch is also rebased on top of guc_parallel_index
Re: [HACKERS] Parallel bitmap heap scan
On Wed, Feb 8, 2017 at 10:59 AM, Robert Haas wrote: > Looks good to me. If nobody has further ideas here, I'll push this > and your previous patch tomorrow. Done. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Wed, Feb 8, 2017 at 8:58 AM, Dilip Kumar wrote: > On Wed, Feb 8, 2017 at 7:01 PM, Robert Haas wrote: >> You can store whatever you want in SH_TYPE's private_data member. >> SH_ALLOCATE and SH_FREE both get a pointer to the SH_TYPE, so they >> have access to that. Hmm, but there's no way to get that set in >> SH_CREATE before SH_ALLOCATE is called. Maybe we need to add a >> private_data argument to SH_CREATE. execGrouping.c could use that >> instead of frobbing private_data directly: >> >> -hashtable->hashtab = tuplehash_create(tablecxt, nbuckets); >> -hashtable->hashtab->private_data = hashtable; >> +hashtable->hashtab = tuplehash_create(tablecxt, nbuckets, hashtable); > > Okay, will go ahead as you suggested. Patch attached for the same. Looks good to me. If nobody has further ideas here, I'll push this and your previous patch tomorrow. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Wed, Feb 8, 2017 at 7:01 PM, Robert Haas wrote: > You can store whatever you want in SH_TYPE's private_data member. > SH_ALLOCATE and SH_FREE both get a pointer to the SH_TYPE, so they > have access to that. Hmm, but there's no way to get that set in > SH_CREATE before SH_ALLOCATE is called. Maybe we need to add a > private_data argument to SH_CREATE. execGrouping.c could use that > instead of frobbing private_data directly: > > -hashtable->hashtab = tuplehash_create(tablecxt, nbuckets); > -hashtable->hashtab->private_data = hashtable; > +hashtable->hashtab = tuplehash_create(tablecxt, nbuckets, hashtable); Okay, will go ahead as you suggested. Patch attached for the same. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com hash_create_fix.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Wed, Feb 8, 2017 at 1:59 AM, Dilip Kumar wrote: > IIUC, tbm_prepare_shared_iterate will be called only by the leader, > for tbmiterator as well as for the prefetch_iterator. And, > tbm_attach_shared_iterate will be called by each backend and for both > the iterators. That's what I had in mind. > IMHO, tbm_attach_shared_iterate should return TBMIterator with > reference to TBMSharedIterator inside it. The reason behind same is > that we can not keep TBMIterateResult inside TBMSharedIterator > otherwise, results will also go in shared memory but we want to have > local result memory for each worker so that other worker doesn't > disturb it. No, I don't agree. I think TBMSharedIterator should be an unshared structure created by tbm_attach_shared_iterate, which can internally contain backend-private state like a TBMIterateResult, and which can also contain a pointer to the shared-memory structure previously created by tbm_prepare_shared_iterate. That thing needs to be called something other than a TBMSharedIterator, like TBMSharedIterationState or something. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Wed, Feb 8, 2017 at 4:20 AM, Dilip Kumar wrote: > The new SH_CREATE(MemoryContext ctx, uint32 nelements) don't have any > option to supply arguments to it. Our callback functions need access > to TBM. > > Is it expected that if the user of SH_CREATE who doesn't want to pass > a "MemoryContext" then we can pass arguments instead of ctx? You can store whatever you want in SH_TYPE's private_data member. SH_ALLOCATE and SH_FREE both get a pointer to the SH_TYPE, so they have access to that. Hmm, but there's no way to get that set in SH_CREATE before SH_ALLOCATE is called. Maybe we need to add a private_data argument to SH_CREATE. execGrouping.c could use that instead of frobbing private_data directly: -hashtable->hashtab = tuplehash_create(tablecxt, nbuckets); -hashtable->hashtab->private_data = hashtable; +hashtable->hashtab = tuplehash_create(tablecxt, nbuckets, hashtable); -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Wed, Feb 8, 2017 at 5:21 AM, Dilip Kumar wrote: > On Wed, Feb 8, 2017 at 3:44 AM, Robert Haas wrote: +#ifndef SH_USE_NONDEFAULT_ALLOCATOR + >>> >>> That should probably be documented in the file header. >> >> Right. OK, did that and a few other cleanups, and committed. > > I think we need to have prototype for the default allocator outside of > #ifndef SH_USE_NONDEFAULT_ALLOCATOR. Because the file e.g. tidbitmap.c > who wants to use SH_USE_NONDEFAULT_ALLOCATOR will provide the > allocator function definition but it can not have the declaration of > those function as that function take SH_TYPE as input and that will be > only defined once we include the simplehash.h. > > So basically we can not declare prototype before including > simplehash.h for allocator. And, if we don't declare we will get > "implicit declaration warning" because simplehash itself is using > those functions. > > The solution is simplehash.h, should always declare it, and provide > the definitions only if SH_USE_NONDEFAULT_ALLOCATOR is not defined. > Attached patch does that. Makes sense, will commit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Wed, Feb 8, 2017 at 3:44 AM, Robert Haas wrote: >>> +#ifndef SH_USE_NONDEFAULT_ALLOCATOR >>> + >> >> That should probably be documented in the file header. > > Right. OK, did that and a few other cleanups, and committed. I think we need to have prototype for the default allocator outside of #ifndef SH_USE_NONDEFAULT_ALLOCATOR. Because the file e.g. tidbitmap.c who wants to use SH_USE_NONDEFAULT_ALLOCATOR will provide the allocator function definition but it can not have the declaration of those function as that function take SH_TYPE as input and that will be only defined once we include the simplehash.h. So basically we can not declare prototype before including simplehash.h for allocator. And, if we don't declare we will get "implicit declaration warning" because simplehash itself is using those functions. The solution is simplehash.h, should always declare it, and provide the definitions only if SH_USE_NONDEFAULT_ALLOCATOR is not defined. Attached patch does that. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com hash_allocate_fix.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Wed, Feb 8, 2017 at 3:44 AM, Robert Haas wrote: >>> +#ifndef SH_USE_NONDEFAULT_ALLOCATOR >>> + >> >> That should probably be documented in the file header. > > Right. OK, did that and a few other cleanups, and committed. The new SH_CREATE(MemoryContext ctx, uint32 nelements) don't have any option to supply arguments to it. Our callback functions need access to TBM. Is it expected that if the user of SH_CREATE who doesn't want to pass a "MemoryContext" then we can pass arguments instead of ctx? something like this ? if (!tbm->dsa) tbm->pagetable = pagetable_create(tbm->mcxt, 128); else tbm->pagetable = pagetable_create((MemoryContext)tbm, 128); And, In allocation function, we can access this context and typecast to tbm? As shown below. static void * pagetable_allocate(pagetable_hash *pagetable, Size size) { TIDBitmap *tbm = pagetable->ctx; So Is it expected to do like I explained above, or we missed to have an arg parameter to SH_CREATE as well as in SH_TYPE structure or there is some other way you have in mind? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Wed, Feb 8, 2017 at 8:07 AM, Robert Haas wrote: .Thanks for your input, I have few queries about these comments. > 2. Add a new function dsa_pointer tbm_prepare_shared_iterate(TIDBitmap > *tbm) which allocates shared iteration arrays using the DSA passed to > tbm_create() and returns a dsa_pointer to the result. Arrange this so > that if it's called more than once, each call returns a separate > iterator, so that you can call it once to get the main iterator and a > second time for the prefetch iterator, but have both of those point to > the same underlying iteration arrays. > > 3. Add a new function TBMSharedIterator > *tbm_attach_shared_iterate(dsa_area *dsa, dsa_pointer) which is called > once per backend and gets passed the dsa_pointer from the previous > step. IIUC, tbm_prepare_shared_iterate will be called only by the leader, for tbmiterator as well as for the prefetch_iterator. And, tbm_attach_shared_iterate will be called by each backend and for both the iterators. IMHO, tbm_attach_shared_iterate should return TBMIterator with reference to TBMSharedIterator inside it. The reason behind same is that we can not keep TBMIterateResult inside TBMSharedIterator otherwise, results will also go in shared memory but we want to have local result memory for each worker so that other worker doesn't disturb it. Another option can be that we change the tbm_shared_iterate as explained below TBMIterateResult * tbm_shared_iterate(TBMSharedIterator *, TBMIterateResult *result). Now, if result passed to this API is NULL then it will allocate the memory for the result and that way we will have local result memory, and if it's not NULL we will use this memory to store our results. BitmapHeapNode already having a reference to the TBMIterateResult so we should not have any problem in passing this reference to the tbm_shared_iterate. I think this one looks better than what I explained above. Please suggest. > > 4. Add a new function TBMIterateResult > *tbm_shared_iterate(TBMSharedIterator *) to fetch the next result. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Sun, Feb 5, 2017 at 9:48 AM, Dilip Kumar wrote: > Here are the latest version of the patch, which address all the > comments given by Robert. I think it would be useful to break the remaining patch into two parts, one that enhances the tidbitmap.c API and another that uses that to implement Parallel Bitmap Heap Scan. In this review I'm going to focus on the changes to tidbitmap.c and tidbitmap.h. +extern void tbm_restore_shared_members(TIDBitmap *tbm, + ParallelTBMInfo * stbm); +extern void tbm_update_shared_members(TIDBitmap *tbm, + ParallelTBMInfo * parallel_tbm); +voidtbm_set_parallel(TIDBitmap *tbm, void *area); +extern Size tbm_estimate_parallel_tbminfo(Size size); +TIDBitmap *tbm_attach(ParallelTBMInfo * parallel_tbm, void *area); +TBMIterator *tbm_begin_shared_iterate(TIDBitmap *tbm, + ParallelTBMInfo * parallel_tbm, bool prefetch); +TBMIterateResult *tbm_shared_iterate(TBMIterator *iterator); +voidtbm_init_shared_iterator(ParallelTBMInfo * tbminfo); There are several cosmetic problems here. You may have noticed that all function prototypes in PostgreSQL header files are explicitly declared extern; yours should be, too. Also, there is extra whitespace before some of the variable names here, like "ParallelTBMInfo * tbminfo" instead of "ParallelTBMInfo *tbminfo". If that is due to pgindent, the solution is to add the typedef names to your local typedef list. Also, tbm_restore_shared_members doesn't actually exist. More broadly, this seems like an extremely complicated API. Even ignoring the function that doesn't exist, that's still 7 different functions just for shared iteration, which seems like a lot. I suggest the following API: 1. Add an additional argument to tbm_create(), dsa_area *dsa. If it's NULL, we allocate a backend-private memory for the hash entries as normal. If it's true, we use the dsa_area to store the hash entries, using the infrastructure added by your 0002 and revised in c3c4f6e1740be256178cd7551d5b8a7677159b74. (You can use a flag in the BitmapIndexScan and BitmapOr to decide whether to pass NULL or es_query_dsa.) 2. Add a new function dsa_pointer tbm_prepare_shared_iterate(TIDBitmap *tbm) which allocates shared iteration arrays using the DSA passed to tbm_create() and returns a dsa_pointer to the result. Arrange this so that if it's called more than once, each call returns a separate iterator, so that you can call it once to get the main iterator and a second time for the prefetch iterator, but have both of those point to the same underlying iteration arrays. 3. Add a new function TBMSharedIterator *tbm_attach_shared_iterate(dsa_area *dsa, dsa_pointer) which is called once per backend and gets passed the dsa_pointer from the previous step. 4. Add a new function TBMIterateResult *tbm_shared_iterate(TBMSharedIterator *) to fetch the next result. As compared with your proposal, this involves only 3 functions instead of 7 (plus one new argument to another function), and I think it's a lot clearer what those functions are actually doing. You don't need tbm_estimate_parallel_tbminfo() any more because the data being passed from one backend to another is precisely a dsa_pointer, and the bitmap scan can just leave space for that in its own estimator function. You don't need tbm_update_shared_members() separately from tbm_begin_shared_iterate() separately from tbm_init_shared_iterator() because tbm_prepare_shared_iterate() can do all that stuff. You don't need tbm_set_parallel() because the additional argument to tbm_create() takes care of that need. The way you've got ParallelTBMInfo set up right now doesn't respect the separation of concerns between nodeBitmapHeapscan.c and tidbitmap.c properly. tidbitmap.c shouldn't know that the caller intends on creating two iterators, one of which is for prefetching. The design above hopefully addresses that: each call to tbm_prepare_shared_iterate returns a dsa_pointer to a separate chunk of shared iterator state, but tidbitmap.c does not need to know how many times that will be called. Apart from the above, this patch will need a rebase over today's commits, and please make sure all functions you add have high-quality header comments. Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Tue, Feb 7, 2017 at 4:45 PM, Andres Freund wrote: > On 2017-02-07 16:36:55 -0500, Robert Haas wrote: >> On Tue, Feb 7, 2017 at 4:15 PM, Andres Freund wrote: >> > FWIW, I think it'd have been better to not add the new callbacks as >> > parameters to *_create(), but rather have them be "templatized" like the >> > rest of simplehash. That'd require that callback to check the context, >> > to know whether it should use shared memory or not, but that seems fine >> > to me. Right now this pushes the head of simplehash above a >> > cacheline... >> >> Something like the attached? > > Yes. > >> +#ifndef SH_USE_NONDEFAULT_ALLOCATOR >> + > > That should probably be documented in the file header. Right. OK, did that and a few other cleanups, and committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On 2017-02-07 16:36:55 -0500, Robert Haas wrote: > On Tue, Feb 7, 2017 at 4:15 PM, Andres Freund wrote: > > FWIW, I think it'd have been better to not add the new callbacks as > > parameters to *_create(), but rather have them be "templatized" like the > > rest of simplehash. That'd require that callback to check the context, > > to know whether it should use shared memory or not, but that seems fine > > to me. Right now this pushes the head of simplehash above a > > cacheline... > > Something like the attached? Yes. > +#ifndef SH_USE_NONDEFAULT_ALLOCATOR > + That should probably be documented in the file header. Thanks! Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Tue, Feb 7, 2017 at 4:15 PM, Andres Freund wrote: > FWIW, I think it'd have been better to not add the new callbacks as > parameters to *_create(), but rather have them be "templatized" like the > rest of simplehash. That'd require that callback to check the context, > to know whether it should use shared memory or not, but that seems fine > to me. Right now this pushes the head of simplehash above a > cacheline... Something like the attached? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company rework-simplehash-allocator.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Tue, Feb 7, 2017 at 4:13 PM, Jeff Janes wrote: > I'm getting compiler errors: > > In file included from execGrouping.c:47: > ../../../src/include/lib/simplehash.h:91: error: redefinition of typedef > 'simplehash_allocate' > ../../../src/include/lib/simplehash.h:91: note: previous declaration of > 'simplehash_allocate' was here > ../../../src/include/lib/simplehash.h:92: error: redefinition of typedef > 'simplehash_free' > ../../../src/include/lib/simplehash.h:92: note: previous declaration of > 'simplehash_free' was here Thanks, I'll stick an #ifdef guard around that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On 2017-02-07 13:13:43 -0800, Jeff Janes wrote: > On Tue, Feb 7, 2017 at 1:03 PM, Robert Haas wrote: > > > On Sun, Feb 5, 2017 at 9:48 AM, Dilip Kumar wrote: > > >> I think maybe we should rename the functions to element_allocate, > > >> element_free, and element_allocator_ctx, or something like that. The > > >> current names aren't capitalized consistently with other things in > > >> this module, and putting the word "element" in there would make it > > >> more clear what the purpose of this thing is. > > > > > > Fixed as per the suggestion > > > > Eh, not really. You changed the memory context to be called > > element_allocator_ctx, rather than changing the args passed to the > > element allocator to have that name, which is what I had in mind. > > > > I did some assorted renaming and other cosmetic improvements and committed > > 0002. > > > > I'm getting compiler errors: > > In file included from execGrouping.c:47: > ../../../src/include/lib/simplehash.h:91: error: redefinition of typedef > 'simplehash_allocate' > ../../../src/include/lib/simplehash.h:91: note: previous declaration of > 'simplehash_allocate' was here > ../../../src/include/lib/simplehash.h:92: error: redefinition of typedef > 'simplehash_free' > ../../../src/include/lib/simplehash.h:92: note: previous declaration of > 'simplehash_free' was here Oh yea, that too - you can't just redefine a typedef like that according to C89 :( -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On 2017-02-07 16:03:43 -0500, Robert Haas wrote: > On Sun, Feb 5, 2017 at 9:48 AM, Dilip Kumar wrote: > >> I think maybe we should rename the functions to element_allocate, > >> element_free, and element_allocator_ctx, or something like that. The > >> current names aren't capitalized consistently with other things in > >> this module, and putting the word "element" in there would make it > >> more clear what the purpose of this thing is. > > > > Fixed as per the suggestion > > Eh, not really. You changed the memory context to be called > element_allocator_ctx, rather than changing the args passed to the > element allocator to have that name, which is what I had in mind. > > I did some assorted renaming and other cosmetic improvements and committed > 0002. FWIW, I think it'd have been better to not add the new callbacks as parameters to *_create(), but rather have them be "templatized" like the rest of simplehash. That'd require that callback to check the context, to know whether it should use shared memory or not, but that seems fine to me. Right now this pushes the head of simplehash above a cacheline... I'm also doubtful about the naming of the default callbacks: +/* default memory allocator function */ +static void * +SH_DEFAULT_ALLOC(Size size, void *args) +{ + MemoryContext context = (MemoryContext) args; + + return MemoryContextAllocExtended(context, size, + MCXT_ALLOC_HUGE | MCXT_ALLOC_ZERO); +} SH_DEFAULT_ALLOC sounds like it's a name that's one of the prefixed names (like SH_CREATE actually becomes pagetable_create and such) - but afaics it's not. Which afaics means that this'll generate symbol conflicts if one translation unit uses multiple simplehash.h style hashes. Afaics these should either be prefixed, or static inline functions. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Tue, Feb 7, 2017 at 1:03 PM, Robert Haas wrote: > On Sun, Feb 5, 2017 at 9:48 AM, Dilip Kumar wrote: > >> I think maybe we should rename the functions to element_allocate, > >> element_free, and element_allocator_ctx, or something like that. The > >> current names aren't capitalized consistently with other things in > >> this module, and putting the word "element" in there would make it > >> more clear what the purpose of this thing is. > > > > Fixed as per the suggestion > > Eh, not really. You changed the memory context to be called > element_allocator_ctx, rather than changing the args passed to the > element allocator to have that name, which is what I had in mind. > > I did some assorted renaming and other cosmetic improvements and committed > 0002. > I'm getting compiler errors: In file included from execGrouping.c:47: ../../../src/include/lib/simplehash.h:91: error: redefinition of typedef 'simplehash_allocate' ../../../src/include/lib/simplehash.h:91: note: previous declaration of 'simplehash_allocate' was here ../../../src/include/lib/simplehash.h:92: error: redefinition of typedef 'simplehash_free' ../../../src/include/lib/simplehash.h:92: note: previous declaration of 'simplehash_free' was here gcc version 4.4.7 20120313 (Red Hat 4.4.7-17) (GCC) Cheers, Jeff
Re: [HACKERS] Parallel bitmap heap scan
On Sun, Feb 5, 2017 at 9:48 AM, Dilip Kumar wrote: >> I think maybe we should rename the functions to element_allocate, >> element_free, and element_allocator_ctx, or something like that. The >> current names aren't capitalized consistently with other things in >> this module, and putting the word "element" in there would make it >> more clear what the purpose of this thing is. > > Fixed as per the suggestion Eh, not really. You changed the memory context to be called element_allocator_ctx, rather than changing the args passed to the element allocator to have that name, which is what I had in mind. I did some assorted renaming and other cosmetic improvements and committed 0002. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Wed, Feb 1, 2017 at 11:09 PM, Robert Haas wrote: Here are the latest version of the patch, which address all the comments given by Robert. > https://www.postgresql.org/message-id/20161017201558.cr34stc6zvxy3...@alap3.anarazel.de > that you should try to share only the iteration arrays, I believe that > he meant the results of tbm_begin_iterate() -- that is, > iterator->spageptr, chunkptr, schunkbit, spages, and schunks. I'm > perhaps putting words into his mouth, but what he said was that we > could avoid sharing "the whole underlying hash". But the patch set > you've implemented here doesn't behave that way. Instead, you've got > the array containing the hash table elements shared (which is good) > plus there's a sort of dummy hash table in every worker which copies > some but not all of the members of the original hash table, leading to > the otherwise-unnecessary if-test that Haribabu is complaining about. > So the hash table is kinda-shared-kinda-not-shared, which I don't > *think* is really what Andres had in mind. > > In short, I think SH_COPY (implemented here as pagetable_copy) needs > to go away. The work of tbm_begin_iterate() should be done before we > begin the shared scan and the results of that work should be shared. I have removed the SH_COPY and now leader performs the tbm_begin_shared_iterate before waking up the worker. Basically, the leader will create the page and chunk array and that is the array of the "relptr" (offlist, suggested by Robert). > What happens right now (it appears) is that every backend does that > work based on the same hash table and we just assume they all get the > same answer. And we really do assume that, because > pbms_parallel_iterate() assumes it can shuttle private state back and > forth between iterator in different backends and nothing will break; > but those backends aren't actually using the same iteration arrays. > They are using different iteration arrays that should have the same > contents because they were all derived from the same semi-shared hash > table. That's pretty fragile, and a waste of CPU cycles if the hash > table is large (every backend does its own sort). > > On a related note, I think it's unacceptable to make the internal > details of TBMIterator public. You've moved it from tidbitmap.c to > tidbitmap.h so that nodeBitmapHeapScan.c can tinker with the guts of > the TBMIterator, but that's not OK. Those details need to remain > private to tidbitmap.c. Fixed pbms_parallel_iterate() is a nasty kludge; we > need some better solution. The knowledge of how a shared iterator > should iterate needs to be private to tidbitmap.c, not off in the > executor someplace. And I think the entries need to actually be > updated in shared memory directly, not copied back and forth between a > backend-private iterator and a shared iterator. I have fixed this, now there is new function called tbm_shared_iterate which will directly iterate using shared iterator. So now no need to copy member back and forth between local and shared iterator. > > Also, pbms_parallel_iterate() can't hold a spinlock around a call to > tbm_iterate(). Note the coding rules for spinlocks mentioned in > spin.h and src/backend/storage/lmgr/README. I think the right thing > to do here is to use an LWLock in a new tranche (add an entry to > BuiltinTrancheIds). Done that way. > > In 0002, you have this: > > +tb->alloc = palloc(sizeof(SH_ALLOCATOR)); > > This should be using MemoryContextAlloc(ctx, ...) rather than palloc. > Otherwise the allocator object is in a different context from > everything else in the hash table. But TBH, I don't really see why we > want this to be a separate object. Why not just put > HashAlloc/HashFree/args into SH_TYPE directly? That avoids some > pointer chasing and doesn't really seem to cost anything (except that > SH_CREATE will grow a slightly longer argument sequence). Done as suggested > > I think maybe we should rename the functions to element_allocate, > element_free, and element_allocator_ctx, or something like that. The > current names aren't capitalized consistently with other things in > this module, and putting the word "element" in there would make it > more clear what the purpose of this thing is. Fixed as per the suggestion -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com 0002-hash-support-alloc-free-v16.patch Description: Binary data 0003-parallel-bitmap-heap-scan-v16.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Wed, Feb 1, 2017 at 11:09 PM, Robert Haas wrote: Hi Robert, Thanks for the review. > When Andres wrote in > https://www.postgresql.org/message-id/20161017201558.cr34stc6zvxy3...@alap3.anarazel.de > that you should try to share only the iteration arrays, I believe that > he meant the results of tbm_begin_iterate() -- that is, > iterator->spageptr, chunkptr, schunkbit, spages, and schunks. I'm > perhaps putting words into his mouth, but what he said was that we > could avoid sharing "the whole underlying hash". But the patch set > you've implemented here doesn't behave that way. Instead, you've got > the array containing the hash table elements shared (which is good) Here is my analysis why we selected this instead of just sharing the iterator: The problem is that each entry of iteration array is just a pointer to hash entry. So if we only shared iterator then iterator element will be pointing to local memory of other workers. I thought of one more option that during tbm_begin_iterate instead of keeping pointer to hash entry, we can make a copy of that in DSA area, but that will have 2 copies of hash table element for short duration which is not good. And, I think what we are doing currently is better than that. The work of tbm_begin_iterate() should be done before we > begin the shared scan and the results of that work should be shared. > What happens right now (it appears) is that every backend does that > work based on the same hash table and we just assume they all get the > same answer. Actually, tbm_begin_iterate is processing the each element of the hash table and converting to chunk and page array and currently, it’s done by each worker and it’s pretty cheap. Suppose I try to do this only by the first worker then implementation will look something like this. 1. First, we need to create a tbm->spages array and tbm->schunks array and both should be the array of dsa_pointers. 2. Then each worker will process these array’s and will convert them to the array of their local pointers. 3. With the current solution where all hash elements are stored in one large dsa chunk, then how we are going to divide them into multiple dsa pointers. I will work on remaining comments. And we really do assume that, because > pbms_parallel_iterate() assumes it can shuttle private state back and > forth between iterator in different backends and nothing will break; > but those backends aren't actually using the same iteration arrays. > They are using different iteration arrays that should have the same > contents because they were all derived from the same semi-shared hash > table. That's pretty fragile, and a waste of CPU cycles if the hash > table is large (every backend does its own sort). > > On a related note, I think it's unacceptable to make the internal > details of TBMIterator public. You've moved it from tidbitmap.c to > tidbitmap.h so that nodeBitmapHeapScan.c can tinker with the guts of > the TBMIterator, but that's not OK. Those details need to remain > private to tidbitmap.c. pbms_parallel_iterate() is a nasty kludge; we > need some better solution. The knowledge of how a shared iterator > should iterate needs to be private to tidbitmap.c, not off in the > executor someplace. And I think the entries need to actually be > updated in shared memory directly, not copied back and forth between a > backend-private iterator and a shared iterator. > > Also, pbms_parallel_iterate() can't hold a spinlock around a call to > tbm_iterate(). Note the coding rules for spinlocks mentioned in > spin.h and src/backend/storage/lmgr/README. I think the right thing > to do here is to use an LWLock in a new tranche (add an entry to > BuiltinTrancheIds). > > In 0002, you have this: > > +tb->alloc = palloc(sizeof(SH_ALLOCATOR)); > > This should be using MemoryContextAlloc(ctx, ...) rather than palloc. > Otherwise the allocator object is in a different context from > everything else in the hash table. But TBH, I don't really see why we > want this to be a separate object. Why not just put > HashAlloc/HashFree/args into SH_TYPE directly? That avoids some > pointer chasing and doesn't really seem to cost anything (except that > SH_CREATE will grow a slightly longer argument sequence). > > I think maybe we should rename the functions to element_allocate, > element_free, and element_allocator_ctx, or something like that. The > current names aren't capitalized consistently with other things in > this module, and putting the word "element" in there would make it > more clear what the purpose of this thing is. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Tue, Jan 31, 2017 at 6:05 AM, Dilip Kumar wrote: >> 0002-hash-support-alloc-free-v14.patch: >> >> >> + if (tb->alloc) >> + { >> >> The memory for tb->alloc is allocated always, is the if check still >> required? > > In parallel case, only first worker will call SH_CREATE, other worker > will only do palloc for pagetable and copy the reference from main > worker, so they will not have allocator. When Andres wrote in https://www.postgresql.org/message-id/20161017201558.cr34stc6zvxy3...@alap3.anarazel.de that you should try to share only the iteration arrays, I believe that he meant the results of tbm_begin_iterate() -- that is, iterator->spageptr, chunkptr, schunkbit, spages, and schunks. I'm perhaps putting words into his mouth, but what he said was that we could avoid sharing "the whole underlying hash". But the patch set you've implemented here doesn't behave that way. Instead, you've got the array containing the hash table elements shared (which is good) plus there's a sort of dummy hash table in every worker which copies some but not all of the members of the original hash table, leading to the otherwise-unnecessary if-test that Haribabu is complaining about. So the hash table is kinda-shared-kinda-not-shared, which I don't *think* is really what Andres had in mind. In short, I think SH_COPY (implemented here as pagetable_copy) needs to go away. The work of tbm_begin_iterate() should be done before we begin the shared scan and the results of that work should be shared. What happens right now (it appears) is that every backend does that work based on the same hash table and we just assume they all get the same answer. And we really do assume that, because pbms_parallel_iterate() assumes it can shuttle private state back and forth between iterator in different backends and nothing will break; but those backends aren't actually using the same iteration arrays. They are using different iteration arrays that should have the same contents because they were all derived from the same semi-shared hash table. That's pretty fragile, and a waste of CPU cycles if the hash table is large (every backend does its own sort). On a related note, I think it's unacceptable to make the internal details of TBMIterator public. You've moved it from tidbitmap.c to tidbitmap.h so that nodeBitmapHeapScan.c can tinker with the guts of the TBMIterator, but that's not OK. Those details need to remain private to tidbitmap.c. pbms_parallel_iterate() is a nasty kludge; we need some better solution. The knowledge of how a shared iterator should iterate needs to be private to tidbitmap.c, not off in the executor someplace. And I think the entries need to actually be updated in shared memory directly, not copied back and forth between a backend-private iterator and a shared iterator. Also, pbms_parallel_iterate() can't hold a spinlock around a call to tbm_iterate(). Note the coding rules for spinlocks mentioned in spin.h and src/backend/storage/lmgr/README. I think the right thing to do here is to use an LWLock in a new tranche (add an entry to BuiltinTrancheIds). In 0002, you have this: +tb->alloc = palloc(sizeof(SH_ALLOCATOR)); This should be using MemoryContextAlloc(ctx, ...) rather than palloc. Otherwise the allocator object is in a different context from everything else in the hash table. But TBH, I don't really see why we want this to be a separate object. Why not just put HashAlloc/HashFree/args into SH_TYPE directly? That avoids some pointer chasing and doesn't really seem to cost anything (except that SH_CREATE will grow a slightly longer argument sequence). I think maybe we should rename the functions to element_allocate, element_free, and element_allocator_ctx, or something like that. The current names aren't capitalized consistently with other things in this module, and putting the word "element" in there would make it more clear what the purpose of this thing is. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Tue, Jan 31, 2017 at 7:47 AM, Haribabu Kommi wrote: > Thanks for the update. I have some comments > Thanks for the review. > > 0002-hash-support-alloc-free-v14.patch: > > > + if (tb->alloc) > + { > > The memory for tb->alloc is allocated always, is the if check still > required? In parallel case, only first worker will call SH_CREATE, other worker will only do palloc for pagetable and copy the reference from main worker, so they will not have allocator. > 0003-parallel-bitmap-heap-scan-v14.patch: > > > + * and set parallel flag in lower level bitmap index scan. Later > + * bitmap index node will use this flag to indicate tidbitmap that > + * it needs to create an shared page table. > + */ > > I feel better to mention, where this flag is used, so that it will be more > clear. Done > > > + * Mark tidbitmap as shared and also store DSA area in it. > + * marking tidbitmap as shared is indication that this tidbitmap > + * should be created in shared memory. DSA area will be used for > > The flag of shared is set in tidbitmap structure itself, but the > comment is mentioned that tidbitmpa should be created in shared memory. > I think that is the page table that needs to be created in shared memory. > Providing some more information here will be helpful. > Done > > - node->tbmres = tbmres = tbm_iterate(tbmiterator); > + node->tbmres = tbmres = pbms_parallel_iterate(tbmiterator, > + pbminfo ? &pbminfo->tbmiterator : NULL); > > Instead Passing both normal and paralle iterators to the functions > and checking inside again for NULL, How about just adding check > in the caller itself? Or if you prefer the current method, then try to > explain the details of input in the function header and more description. > Done as you said. > > + /* Increase prefetch target if it's not yet at the max. */ > + if (node->prefetch_target < node->prefetch_maximum) > > I didn't evaluate all scenarios, but the above code may have a problem, > In case of parallel mode the the prefetch_target is fetched from node > and not from the pbminfo. Later it gets from the pminfo and update that. > May be this code needs to rewrite. > Fixed it. > > + TBMIterator *iterator = node->prefetch_iterator; > > Why another variable? Why can't we use the prefetch_iterator itself? > Currently node->tbmiterator and node->prefetch_iterator are initialized > irrespective of whether is it a parallel one or not. But while using, there > is a check to use the parallel one in case of parallel. if it is the case, > why can't we avoid the initialization itself? Fixed > > > + else > + needWait = false; > > By default needWait is false. Just set that to true only for the > case of PBM_INPROGRESS > Actually inside the while loop, suppose first we set needWait = true, if PBM_INPROGRESS and got for ConditionalSleep, After it come out of sleep, we need to check that PBM_FINISHED is set or we need to sleep again, So in such case we need to reset it to needWait=false. However this can be done by directly returning if it's PBM_FINISHED. But I want to keep below the logic common. + SpinLockRelease(&pbminfo->state_mutex); + + /* If we are leader or leader has already created a TIDBITMAP */ + if (leader || !needWait) + break; > + * so that during free we can directly get the dsa_pointe and free it. > > dsa_pointe -> dsa_pointer Done > > > +typedef struct > +{ > + TIDBitmap *tbm; /* TIDBitmap we're iterating over */ > + int spageptr; /* next spages index */ > + int schunkptr; /* next schunks index */ > + int schunkbit; /* next bit to check in current schunk */ > + TBMIterateResult output; /* MUST BE LAST (because variable-size) */ > +} TBMIterator; > > I didn't find the need of moving this structure. Can you point me where it > is used. Because pbms_parallel_iterate need to access this structure so I moved it to tidbitmap.h -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com 0003-parallel-bitmap-heap-scan-v15.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Tue, Jan 31, 2017 at 11:17 AM, Haribabu Kommi wrote: > On Fri, Jan 27, 2017 at 5:32 PM, Dilip Kumar wrote: >> On Tue, Jan 24, 2017 at 10:18 AM, Dilip Kumar >> wrote: >> > I have changed as per the comments. 0002 and 0003 are changed, 0001 is >> > still the same. >> >> There is just one line change in 0003 compared to older version, all >> other patches are the same. > > Thanks for the update. I have some comments This review is too fresh to be addressed, so I have moved this patch to the next CF. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Fri, Jan 27, 2017 at 5:32 PM, Dilip Kumar wrote: > On Tue, Jan 24, 2017 at 10:18 AM, Dilip Kumar > wrote: > > I have changed as per the comments. 0002 and 0003 are changed, 0001 is > > still the same. > > 2 days back my colleague Rafia, reported one issue (offlist) that > parallel bitmap node is not scaling as good as other nodes e.g > parallel sequence scan and parallel index scan. > > After some perf analysis, I found that there was one unconditional > spin lock in parallel bitmap patch which we were taking for checking > the prefetch target. Basically, we were always taking the lock and > checking the prefetch_target is reached to prefetch_maximum. So even > after it will reach to prefetch_maximum we will acquire the lock and > check after every tuple. I have changed that logic, now I will check > the condition first if we need to increase the target then will take > the lock and recheck the condition. > > There is just one line change in 0003 compared to older version, all > other patches are the same. > > Some performance data to show that new parallel bitmap patch performs > way better than the previous version. > TPCH scale 20, work_mem 64MB, shared buffers 8GB (4 workers) > machine intel 8 socket machine > > v13(time in ms) v14 (time in ms) > Q637065.84118202.903 > > Q14 15229.569 11341.121 Thanks for the update. I have some comments 0002-hash-support-alloc-free-v14.patch: + if (tb->alloc) + { The memory for tb->alloc is allocated always, is the if check still required? 0003-parallel-bitmap-heap-scan-v14.patch: + * and set parallel flag in lower level bitmap index scan. Later + * bitmap index node will use this flag to indicate tidbitmap that + * it needs to create an shared page table. + */ I feel better to mention, where this flag is used, so that it will be more clear. + * Mark tidbitmap as shared and also store DSA area in it. + * marking tidbitmap as shared is indication that this tidbitmap + * should be created in shared memory. DSA area will be used for The flag of shared is set in tidbitmap structure itself, but the comment is mentioned that tidbitmpa should be created in shared memory. I think that is the page table that needs to be created in shared memory. Providing some more information here will be helpful. - node->tbmres = tbmres = tbm_iterate(tbmiterator); + node->tbmres = tbmres = pbms_parallel_iterate(tbmiterator, + pbminfo ? &pbminfo->tbmiterator : NULL); Instead Passing both normal and paralle iterators to the functions and checking inside again for NULL, How about just adding check in the caller itself? Or if you prefer the current method, then try to explain the details of input in the function header and more description. + /* Increase prefetch target if it's not yet at the max. */ + if (node->prefetch_target < node->prefetch_maximum) I didn't evaluate all scenarios, but the above code may have a problem, In case of parallel mode the the prefetch_target is fetched from node and not from the pbminfo. Later it gets from the pminfo and update that. May be this code needs to rewrite. + TBMIterator *iterator = node->prefetch_iterator; Why another variable? Why can't we use the prefetch_iterator itself? Currently node->tbmiterator and node->prefetch_iterator are initialized irrespective of whether is it a parallel one or not. But while using, there is a check to use the parallel one in case of parallel. if it is the case, why can't we avoid the initialization itself? + else + needWait = false; By default needWait is false. Just set that to true only for the case of PBM_INPROGRESS + * so that during free we can directly get the dsa_pointe and free it. dsa_pointe -> dsa_pointer +typedef struct +{ + TIDBitmap *tbm; /* TIDBitmap we're iterating over */ + int spageptr; /* next spages index */ + int schunkptr; /* next schunks index */ + int schunkbit; /* next bit to check in current schunk */ + TBMIterateResult output; /* MUST BE LAST (because variable-size) */ +} TBMIterator; I didn't find the need of moving this structure. Can you point me where it is used. Regards, Hari Babu Fujitsu Australia
Re: [HACKERS] Parallel bitmap heap scan
On Fri, Jan 27, 2017 at 1:32 AM, Dilip Kumar wrote: > There is just one line change in 0003 compared to older version, all > other patches are the same. I spent some time looking at 0001 (and how those changes are used in 0003) and I thought it looked good, so I committed 0001. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Tue, Jan 24, 2017 at 10:18 AM, Dilip Kumar wrote: > I have changed as per the comments. 0002 and 0003 are changed, 0001 is > still the same. 2 days back my colleague Rafia, reported one issue (offlist) that parallel bitmap node is not scaling as good as other nodes e.g parallel sequence scan and parallel index scan. After some perf analysis, I found that there was one unconditional spin lock in parallel bitmap patch which we were taking for checking the prefetch target. Basically, we were always taking the lock and checking the prefetch_target is reached to prefetch_maximum. So even after it will reach to prefetch_maximum we will acquire the lock and check after every tuple. I have changed that logic, now I will check the condition first if we need to increase the target then will take the lock and recheck the condition. There is just one line change in 0003 compared to older version, all other patches are the same. Some performance data to show that new parallel bitmap patch performs way better than the previous version. TPCH scale 20, work_mem 64MB, shared buffers 8GB (4 workers) machine intel 8 socket machine v13(time in ms) v14 (time in ms) Q637065.84118202.903 Q14 15229.569 11341.121 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com 0001-opt-parallelcost-refactoring-v14.patch Description: Binary data 0002-hash-support-alloc-free-v14.patch Description: Binary data 0003-parallel-bitmap-heap-scan-v14.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Mon, Jan 23, 2017 at 1:52 PM, Haribabu Kommi wrote: > I reviewed 0002-hash-support-alloc-free-v12.patch, some minor comments. > > - SH_TYPE*tb; > - uint64 size; > + SH_TYPE *tb; > + uint64 size; > > The above change may not be required. > > + if (tb->alloc) > + { > + tb->alloc->HashFree(tb->data, tb->alloc->args); > + pfree(tb->alloc); > + } > > The above code tries to free the tb->alloc memory. In case if the user > has provide the alloc structure to SH_CREATE function and the same > pointer is stored in the tb structure. And in free function freeing that > memory may cause problem. > > So either explicitly mentioning that the input must a palloc'ed data or > by default allocate memory and copy the input data into allocated > memory. I have changed as per the comments. 0002 and 0003 are changed, 0001 is still the same. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com 0001-opt-parallelcost-refactoring-v13.patch Description: Binary data 0002-hash-support-alloc-free-v13.patch Description: Binary data 0003-parallel-bitmap-heap-scan-v13.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Mon, Jan 23, 2017 at 3:42 PM, Dilip Kumar wrote: > On Thu, Jan 19, 2017 at 12:26 AM, Robert Haas > wrote: > >> Patch 0001 and 0003 required to rebase on the latest head. 0002 is > >> still the same. > > > > I've committed the first half of 0001. > Thanks. 0001 and 0003 required rebasing after this commit. I reviewed 0002-hash-support-alloc-free-v12.patch, some minor comments. - SH_TYPE*tb; - uint64 size; + SH_TYPE *tb; + uint64 size; The above change may not be required. + if (tb->alloc) + { + tb->alloc->HashFree(tb->data, tb->alloc->args); + pfree(tb->alloc); + } The above code tries to free the tb->alloc memory. In case if the user has provide the alloc structure to SH_CREATE function and the same pointer is stored in the tb structure. And in free function freeing that memory may cause problem. So either explicitly mentioning that the input must a palloc'ed data or by default allocate memory and copy the input data into allocated memory. Regards, Hari Babu Fujitsu Australia
Re: [HACKERS] Parallel bitmap heap scan
On Thu, Jan 19, 2017 at 12:26 AM, Robert Haas wrote: >> Patch 0001 and 0003 required to rebase on the latest head. 0002 is >> still the same. > > I've committed the first half of 0001. Thanks. 0001 and 0003 required rebasing after this commit. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com 0001-opt-parallelcost-refactoring-v12.patch Description: Binary data 0002-hash-support-alloc-free-v12.patch Description: Binary data 0003-parallel-bitmap-heap-scan-v12.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Wed, Jan 18, 2017 at 12:14 AM, Dilip Kumar wrote: > On Fri, Jan 13, 2017 at 6:36 PM, Dilip Kumar wrote: >> >> Please verify with the new patch. > > Patch 0001 and 0003 required to rebase on the latest head. 0002 is > still the same. I've committed the first half of 0001. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Fri, Jan 13, 2017 at 6:36 PM, Dilip Kumar wrote: > > Please verify with the new patch. Patch 0001 and 0003 required to rebase on the latest head. 0002 is still the same. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com 0001-opt-parallelcost-refactoring-v11.patch Description: Binary data 0002-hash-support-alloc-free-v11.patch Description: Binary data 0003-parallel-bitmap-heap-scan-v11.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Thu, Jan 12, 2017 at 5:47 PM, Dilip Kumar wrote: >> Hello Dilip, >> I was trying to test the performance of all the parallel query related >> patches on TPC-H queries, I found that when parallel hash [1 ]and your >> patch are applied then Q10 and Q14 were hanged, however, without any >> one of these patches, these queries are running fine. Following is the >> stack trace, > > Thanks for reporting, I will look into this. I had setup for TPCH scale factor 5, I could reproduce the hang reported by you with Q10. I have fixed the issue in v10 (only 0003 is changed other patches have no change). However, after fixing this issue it's crashing in parallel shared hash patch and call stack is same what you have reported on parallel shared hash thread[1] [1] https://www.postgresql.org/message-id/CAOGQiiOJji9GbLwfT0vBuixdkgAsxzZjTpPxsO6BiTLD--SzYg%40mail.gmail.com Please verify with the new patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com 0003-parallel-bitmap-heap-scan-v10.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Thu, Jan 12, 2017 at 5:22 PM, Rafia Sabih wrote: > Hello Dilip, > I was trying to test the performance of all the parallel query related > patches on TPC-H queries, I found that when parallel hash [1 ]and your > patch are applied then Q10 and Q14 were hanged, however, without any > one of these patches, these queries are running fine. Following is the > stack trace, Thanks for reporting, I will look into this. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Wed, Jan 11, 2017 at 5:33 PM, tushar wrote: > > On 01/10/2017 05:16 PM, Dilip Kumar wrote: >> >> Please try attached patch and confirm from your >> side. > > Thanks,issue seems to be fixed now. > > > -- > regards,tushar > EnterpriseDB https://www.enterprisedb.com/ > The Enterprise PostgreSQL Company > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers Hello Dilip, I was trying to test the performance of all the parallel query related patches on TPC-H queries, I found that when parallel hash [1 ]and your patch are applied then Q10 and Q14 were hanged, however, without any one of these patches, these queries are running fine. Following is the stack trace, At the master: #0 0x3fff8bef7de8 in __epoll_wait_nocancel () from /lib64/power8/libc.so.6 #1 0x104ea184 in WaitEventSetWaitBlock (set=0x10039b2d360, cur_timeout=-1, occurred_events=0x3fffd4a55a98, nevents=1) at latch.c:998 #2 0x104e9fc8 in WaitEventSetWait (set=0x10039b2d360, timeout=-1, occurred_events=0x3fffd4a55a98, nevents=1, wait_event_info=134217730) at latch.c:950 #3 0x104e9484 in WaitLatchOrSocket (latch=0x3fff85128ab4, wakeEvents=1, sock=-1, timeout=-1, wait_event_info=134217730) at latch.c:350 #4 0x104e931c in WaitLatch (latch=0x3fff85128ab4, wakeEvents=1, timeout=0, wait_event_info=134217730) at latch.c:304 #5 0x1032a378 in gather_readnext (gatherstate=0x10039aeeb98) at nodeGather.c:393 #6 0x1032a044 in gather_getnext (gatherstate=0x10039aeeb98) at nodeGather.c:293 #7 0x10329e94 in ExecGather (node=0x10039aeeb98) at nodeGather.c:234 #8 0x103086f0 in ExecProcNode (node=0x10039aeeb98) at execProcnode.c:521 #9 0x1031f550 in fetch_input_tuple (aggstate=0x10039aed220) at nodeAgg.c:587 #10 0x103229a4 in agg_retrieve_direct (aggstate=0x10039aed220) at nodeAgg.c:2211 At one of the worker process, #0 0x103879f8 in pagetable_insert (tb=0x10039abcac0, key=6491, found=0x3fffd4a54b88 "") at ../../../src/include/lib/simplehash.h:571 #1 0x10389964 in tbm_get_pageentry (tbm=0x10039ab5778, pageno=6491) at tidbitmap.c:876 #2 0x10388608 in tbm_add_tuples (tbm=0x10039ab5778, tids=0x10039a94c30, ntids=1, recheck=0 '\000') at tidbitmap.c:340 #3 0x100f5e54 in btgetbitmap (scan=0x10039ab3c80, tbm=0x10039ab5778) at nbtree.c:439 #4 0x100eab7c in index_getbitmap (scan=0x10039ab3c80, bitmap=0x10039ab5778) at indexam.c:687 #5 0x10328acc in MultiExecBitmapIndexScan (node=0x10039a98630) at nodeBitmapIndexscan.c:98 #6 0x103088d8 in MultiExecProcNode (node=0x10039a98630) at execProcnode.c:591 #7 0x10326c70 in BitmapHeapNext (node=0x10039a98770) at nodeBitmapHeapscan.c:164 #8 0x10316440 in ExecScanFetch (node=0x10039a98770, accessMtd=0x10326b70 , recheckMtd=0x10327bb4 ) at execScan.c:95 #9 0x10316590 in ExecScan (node=0x10039a98770, accessMtd=0x10326b70 , recheckMtd=0x10327bb4 ) at execScan.c:180 #10 0x10327c84 in ExecBitmapHeapScan (node=0x10039a98770) at nodeBitmapHeapscan.c:623 At other workers, #0 0x3fff8bef7de8 in __epoll_wait_nocancel () from /lib64/power8/libc.so.6 #1 0x104ea184 in WaitEventSetWaitBlock (set=0x10039a24670, cur_timeout=-1, occurred_events=0x3fffd4a55ed8, nevents=1) at latch.c:998 #2 0x104e9fc8 in WaitEventSetWait (set=0x10039a24670, timeout=-1, occurred_events=0x3fffd4a55ed8, nevents=1, wait_event_info=134217737) at latch.c:950 #3 0x1051a3dc in ConditionVariableSleep (cv=0x3fff7c2d01ac, wait_event_info=134217737) at condition_variable.c:132 #4 0x103283e0 in pbms_is_leader (pbminfo=0x3fff7c2d0178) at nodeBitmapHeapscan.c:900 #5 0x10326c38 in BitmapHeapNext (node=0x10039a95f50) at nodeBitmapHeapscan.c:153 #6 0x10316440 in ExecScanFetch (node=0x10039a95f50, accessMtd=0x10326b70 , recheckMtd=0x10327bb4 ) at execScan.c:95 #7 0x10316590 in ExecScan (node=0x10039a95f50, accessMtd=0x10326b70 , recheckMtd=0x10327bb4 ) at execScan.c:180 #8 0x10327c84 in ExecBitmapHeapScan (node=0x10039a95f50) at nodeBitmapHeapscan.c:623 #9 0x10308588 in ExecProcNode (node=0x10039a95f50) at execProcnode.c:443 #10 0x103310d8 in ExecHashJoinOuterGetTuple (outerNode=0x10039a95f50, hjstate=0x10039a96808, hashvalue=0x3fffd4a5639c) at nodeHashjoin.c:936 I was using TPC-H with scale factor 20, please let me know if there is anything more you require in this regard. [1] https://www.postgresql.org/message-id/CAEepm%3D1vGcv6LBrxZeqPb_rPxfraidWAF_8_4z2ZMQ%2B7DOjj9w%40mail.gmail.com -- Regards, Rafia Sabih EnterpriseDB: http://www.enterprisedb.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On 01/10/2017 05:16 PM, Dilip Kumar wrote: Please try attached patch and confirm from your side. Thanks,issue seems to be fixed now. -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Tue, Jan 10, 2017 at 2:19 PM, tushar wrote: > We found a regression , earlier the testcase was working fine (against the > older patches of Parallel bitmap heap scan) but now getting a server crash > against v8 patches. > > Testcase - (one of the table of TPC-H ) > > postgres=#explain analyze verbose > SELECT SUM(l_extendedprice) FROM lineitem > WHERE (l_shipdate >= '1995-01-01'::date) > AND (l_shipdate <='1996-03-31'::date); While fixing some of the review comments in v7 and v8, I had allocated new memory for pagetable, and missed to initialize it. Thanks for reporting, I have fixed it in v9. After fix query is running fine for me. Please try attached patch and confirm from your side. postgres=# explain analyze verbose SELECT SUM(l_extendedprice) FROM lineitem WHERE (l_shipdate >= '1995-01-01'::date) AND (l_shipdate <='1996-03-31'::date); QUERY PLAN -- --- Finalize Aggregate (cost=798002.46..798002.47 rows=1 width=32) (actual time=15501.245..15501.245 rows=1 loops=1) Output: sum(l_extendedprice) -> Gather (cost=798002.24..798002.45 rows=2 width=32) (actual time=15494.358..15498.919 rows=3 loops=1) Output: (PARTIAL sum(l_extendedprice)) Workers Planned: 2 Workers Launched: 2 -> Partial Aggregate (cost=797002.24..797002.25 rows=1 width=32) (actual time=15492.937..15492.937 rows=1 loops=3) Output: PARTIAL sum(l_extendedprice) Worker 0: actual time=15491.218..15491.219 rows=1 loops=1 Worker 1: actual time=15493.514..15493.514 rows=1 loops=1 -> Parallel Bitmap Heap Scan on public.lineitem (cost=147461.75..791014.31 rows=2395170 width=8) (actual time=8553.301..15061.333 rows=189294 7 loops=3) Output: l_extendedprice Recheck Cond: ((lineitem.l_shipdate >= '1995-01-01'::date) AND (lineitem.l_shipdate <= '1996-03-31'::date)) Rows Removed by Index Recheck: 6451177 Heap Blocks: exact=27963 lossy=164938 Worker 0: actual time=8548.957..15054.511 rows=1887239 loops=1 Worker 1: actual time=8554.817..15050.317 rows=1902477 loops=1 -> Bitmap Index Scan on idx_lineitem_shipdate (cost=0.00..146024.65 rows=5748409 width=0) (actual time=8533.701..8533.701 rows=5678841 loops=1) Index Cond: ((lineitem.l_shipdate >= '1995-01-01'::date) AND (lineitem.l_shipdate <= '1996-03-31'::date)) Planning time: 2.742 ms Execution time: 15509.696 ms (21 rows) -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com 0001-opt-parallelcost-refactoring-v9.patch Description: Binary data 0002-hash-support-alloc-free-v9.patch Description: Binary data 0003-parallel-bitmap-heap-scan-v9.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On 01/10/2017 11:29 AM, tushar wrote: On 01/09/2017 07:22 PM, Dilip Kumar wrote: Thanks, Tushar. I have fixed it. The defect was in 0002. I have also observed another issue related to code refactoring, Actually, there was some code present in 0001 which supposed to be in 0003. Thanks, I have checked at my end and it is fixed now. We found a regression , earlier the testcase was working fine (against the older patches of Parallel bitmap heap scan) but now getting a server crash against v8 patches. Testcase - (one of the table of TPC-H ) postgres=#explain analyze verbose SELECT SUM(l_extendedprice) FROM lineitem WHERE (l_shipdate >= '1995-01-01'::date) AND (l_shipdate <='1996-03-31'::date); WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. connection to server was lost Here is the stack trace - ( Two core dump file generated) [centos@centos-cpula bin]$ gdb -q -c data/core.25434 /home/centos/PG10_10jan/postgresql/edbpsql/bin/postgres Reading symbols from /home/centos/PG10_10jan/postgresql/edbpsql/bin/postgres...done. [New Thread 25434] Missing separate debuginfo for Try: yum --enablerepo='*-debug*' install /usr/lib/debug/.build-id/7f/719af91ee951b4fcb6647e7868f95f766a616b Reading symbols from /usr/lib64/libssl.so.10...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libssl.so.10 Reading symbols from /usr/lib64/libcrypto.so.10...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libcrypto.so.10 Reading symbols from /lib64/librt.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/librt.so.1 Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libdl.so.2 Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libm.so.6 Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libc.so.6 Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done. [Thread debugging using libthread_db enabled] Loaded symbols for /lib64/libpthread.so.0 Reading symbols from /lib64/libgssapi_krb5.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libgssapi_krb5.so.2 Reading symbols from /lib64/libkrb5.so.3...(no debugging symbols found)...done. Loaded symbols for /lib64/libkrb5.so.3 Reading symbols from /lib64/libcom_err.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libcom_err.so.2 Reading symbols from /lib64/libk5crypto.so.3...(no debugging symbols found)...done. Loaded symbols for /lib64/libk5crypto.so.3 Reading symbols from /lib64/libz.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libz.so.1 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Reading symbols from /lib64/libkrb5support.so.0...(no debugging symbols found)...done. Loaded symbols for /lib64/libkrb5support.so.0 Reading symbols from /lib64/libkeyutils.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libkeyutils.so.1 Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libresolv.so.2 Reading symbols from /lib64/libselinux.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libselinux.so.1 Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libnss_files.so.2 Core was generated by `postgres: bgworker: parallel worker for PID 25433 '. Program terminated with signal 11, Segmentation fault. #0 0x006f2fa6 in pagetable_destroy (tb=0x2079bf0) at ../../../src/include/lib/simplehash.h:361 361tb->alloc->HashFree(tb->data, tb->alloc->args); Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.192.el6.x86_64 keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-57.el6.x86_64 libcom_err-1.41.12-22.el6.x86_64 libselinux-2.0.94-7.el6.x86_64 openssl-1.0.1e-48.el6_8.1.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) bt #0 0x006f2fa6 in pagetable_destroy (tb=0x2079bf0) at ../../../src/include/lib/simplehash.h:361 #1 0x006f3b52 in tbm_free (tbm=0x2077fe0) at tidbitmap.c:296 #2 0x006ab29b in ExecEndBitmapHeapScan (node=0x207e760) at nodeBitmapHeapscan.c:717 #3 0x00691701 in ExecEndNode (node=0x207e760) at execProcnode.c:689 #4 0x006a8f86 in ExecEndAgg (node=0x207e878) at nodeAgg.c:3563 #5 0x000
Re: [HACKERS] Parallel bitmap heap scan
On 01/09/2017 07:22 PM, Dilip Kumar wrote: Thanks, Tushar. I have fixed it. The defect was in 0002. I have also observed another issue related to code refactoring, Actually, there was some code present in 0001 which supposed to be in 0003. Thanks, I have checked at my end and it is fixed now. -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Mon, Jan 9, 2017 at 5:01 PM, tushar wrote: > Thanks Dilip. issue is reproducible if we uses '--enable-cassert' switch > in the configure. We are able to reproduce it only with --enable-cassert' . Thanks, Tushar. I have fixed it. The defect was in 0002. I have also observed another issue related to code refactoring, Actually, there was some code present in 0001 which supposed to be in 0003. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com 0001-opt-parallelcost-refactoring-v8.patch Description: Binary data 0002-hash-support-alloc-free-v8.patch Description: Binary data 0003-parallel-bitmap-heap-scan-v8.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On 01/09/2017 04:36 PM, Dilip Kumar wrote: I have taken the latest code, applied all 3 patches and compiled. Initdb is working fine for me. Can you please verify, do you have any extra patch along with my patch? Did you properly clean the code? Thanks Dilip. issue is reproducible if we uses '--enable-cassert' switch in the configure. We are able to reproduce it only with --enable-cassert' . -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Mon, Jan 9, 2017 at 3:07 PM, tushar wrote: > creating directory data ... ok > creating subdirectories ... ok > selecting default max_connections ... 100 > selecting default shared_buffers ... 128MB > selecting dynamic shared memory implementation ... posix > creating configuration files ... ok > running bootstrap script ... ok > performing post-bootstrap initialization ... sh: line 1: 30709 Segmentation > fault "/home/centos/PG10_9ja/postgresql/edbpsql/bin/postgres" --single -F -O > -j -c search_path=pg_catalog -c exit_on_error=true template1 > /dev/null > child process exited with exit code 139 > initdb: removing data directory "data" I have taken the latest code, applied all 3 patches and compiled. Initdb is working fine for me. Can you please verify, do you have any extra patch along with my patch? Did you properly clean the code? I have asked one of my colleague to verify this and the result is same, No crash. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On 01/09/2017 01:05 PM, Dilip Kumar wrote: This patch can be used by 0003-parallel-bitmap-heap-scan-v7.patch attached in the mail and also parallel-index-scan[2] can be rebased on this, if this get committed, After applying your patches against the fresh sources of PG v10 , not able to perform initdb centos@tusharcentos7 bin]$ ./initdb -D data The files belonging to this database system will be owned by user "centos". This user must also own the server process. The database cluster will be initialized with locale "en_US.utf8". The default database encoding has accordingly been set to "UTF8". The default text search configuration will be set to "english". Data page checksums are disabled. creating directory data ... ok creating subdirectories ... ok selecting default max_connections ... 100 selecting default shared_buffers ... 128MB selecting dynamic shared memory implementation ... posix creating configuration files ... ok running bootstrap script ... ok performing post-bootstrap initialization ... sh: line 1: 30709 Segmentation fault "/home/centos/PG10_9ja/postgresql/edbpsql/bin/postgres" --single -F -O -j -c search_path=pg_catalog -c exit_on_error=true template1 > /dev/null child process exited with exit code 139 initdb: removing data directory "data" [centos@tusharcentos7 bin]$ -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Fri, Jan 6, 2017 at 10:47 AM, Amit Khandekar wrote: > This looks good now. Thanks.. > > > Further points below Thanks for the review. > = nodeBItmapHeapscan.c == > > > In BitmapHeapNext(), the following code is relevant only for tbm > returned from MultiExecProcNode(). > if (!tbm || !IsA(tbm, TIDBitmap)) > elog(ERROR, "unrecognized result from subplan"); Fixed > -- > > BitmapHeapScanState->is_leader field looks unnecessary. Instead, a > local variable is_leader in BitmapHeapNext() will solve the purpose. > This is because is_leader is used only in BitmapHeapNext(). Fixed > > -- > > In BitmapHeapNext(), just before tbm_restore_shared_members() is > called, we create tbm using tbm_create(). I think tbm_create() does > not make sense for shared tbm. Whatever fields are required, will be > restored in tbm_restore_shared_members(). The other fields which do > not make sense in a restored tbm are not required to be initialized > using tbm_create(). So I think tbm_restore_shared_members() itself can > call makeNode(TIDBitmap). (Also it is not required to initialize > tbm->allocator; see note below in tidbitmap.c section). So > tbm_restore_shared_members() itself can call tbm_set_parallel(). > Looking at all this, it looks better to have the function name changed > to tbm_attach(parallel_tbm) or tbm_restore(parallel_tbm) rather than > tbm_restore_shared_members(). The function header anyways (rightly) > says : Attach worker to shared TID bitmap. Fixed > > - > > From what I understand, the leader worker does not have to create its > iterators before waking up the other workers, as long as it makes sure > it copies tbm fields into shared memory before waking workers. But in > the patch, tbm_begin_iterate() is called *before* the > ConditionVariableBroadcast() is called. So I feel, we can shift the > code inside the "if (node->is_leader)" to a place inside the "if > (pbminfo == NULL || pbms_is_leader(pbminfo))" condition block, just > after MultiExecProcNode() call. (And we won't even need is_leader > local variable as well). This way now the other workers will start > working sooner. Correct, Fixed. > > > > == tidbitmap.c === > > > The new #include's are not in alphabetical order. Done.. > > -- > > ParallelTIDBitmap.inited is unused, and I believe, not required. Fixed > > -- > > For leader worker, the new TIDBitmap fields added for parallel bitmap > *are* valid while the tid is being built. So the below comment should > be shifted accordingly : > /* these are valid when iterating is true: */ > Better still, the shared tbm-related fields can be kept in the end, > and a comment should be added that these are for shared tbm. > Done > -- > > It seems, the comment below the last ParallelTIDBitmap field is not relevant : > /* table to dsa_pointer's array. */ Fixed.. > > -- > > tbm->allocator field does not seem to be required. A new allocator can > be just palloc'ed in tbm_create_pagetable(), and passed to > pagetable_create(). SH_CREATE() stores this allocator in the > SH_TYPE->alloc field, and fetches the same whenever it needs it for > calling any of the allocator functions. So we can remove the > tbm->allocator field and shift "palloc(sizeof(pagetable_alloc))" call > from tbm_create() to tbm_create_pagetable(). Done > > -- > > In tbm_free() : > I think we should call pagetable_destroy() even when tbm is shared, so > that the hash implementation gets a chance to free the hash table > structure. I understand that the hash table structure itself is not > big, so it will only be a small memory leak, but it's better to not > assume that. Instead, let SH_DESTROY() call HashFree(). Then, in > tbm_free_shared(), we can skip the dsa_free() call if tbm->iterating > is false. Basically, tbm bitmap implementation should deduce from the > bitmap state whether it should free the shared data, rather than > preventing a call to SH_DESTROY(). Fixed > > --- > > In tbm_begin_iterate(), for shared tbm, internal structures from > simplehash.h are assumed to be known. For e.g., the hash entries will > always be present in one single array, and also the entry status is > evaluated using pagetable_IN_USE. Is simplehash.h designed keeping in > mind that these things are suppose to be exposed ? > > I understand that the hash table handle is local to the leader worker, > and so it is not accessible to other workers. And so, we cannot use > pagetable_iterate() to scan the hash table. So, how about copying the > SH_TYPE structure and making it accessible to other workers ? If we > have something like SH_ATTACH() or SH_COPY(), this will copy the > relevant fields that are sufficient to restore the SH_TYPE structure, > and other workers can start using this hash table after assigning dsa > array back to tb->data. Something like HASH_ATTACH used in dynahash.c. This looks cleaner, and also avoid processin
Re: [HACKERS] Parallel bitmap heap scan
Sorry for the delay in my next response. I still haven't exhaustively gone through all changes, but meanwhile, below are some more points. On 26 November 2016 at 18:18, Dilip Kumar wrote: > On Tue, Nov 22, 2016 at 9:05 AM, Dilip Kumar wrote: >> On Fri, Nov 18, 2016 at 9:59 AM, Amit Khandekar >> wrote: >> >> Thanks for the review.. > > I have worked on these comments.. >> >>> In pbms_is_leader() , I didn't clearly understand the significance of >>> the for-loop. If it is a worker, it can call >>> ConditionVariablePrepareToSleep() followed by >>> ConditionVariableSleep(). Once it comes out of >>> ConditionVariableSleep(), isn't it guaranteed that the leader has >>> finished the bitmap ? If yes, then it looks like it is not necessary >>> to again iterate and go back through the pbminfo->state checking. >>> Also, with this, variable queuedSelf also might not be needed. But I >>> might be missing something here. >> >> I have taken this logic from example posted on conditional variable thread[1] >> >> for (;;) >> { >> ConditionVariablePrepareToSleep(cv); >> if (condition for which we are waiting is satisfied) >> break; >> ConditionVariableSleep(); >> } >> ConditionVariableCancelSleep(); >> >> [1] >> https://www.postgresql.org/message-id/CA%2BTgmoaj2aPti0yho7FeEf2qt-JgQPRWb0gci_o1Hfr%3DC56Xng%40mail.gmail.com With the latest patches, this looks fine to me. >>> tbm_iterate() call under SpinLock : >>> For parallel tbm iteration, tbm_iterate() is called while SpinLock is >>> held. Generally we try to keep code inside Spinlock call limited to a >>> few lines, and that too without occurrence of a function call. >>> Although tbm_iterate() code itself looks safe under a spinlock, I was >>> checking if we can squeeze SpinlockAcquire() and SpinLockRelease() >>> closer to each other. One thought is : in tbm_iterate(), acquire the >>> SpinLock before the while loop that iterates over lossy chunks. Then, >>> if both chunk and per-page data remain, release spinlock just before >>> returning (the first return stmt). And then just before scanning >>> bitmap of an exact page, i.e. just after "if (iterator->spageptr < >>> tbm->npages)", save the page handle, increment iterator->spageptr, >>> release Spinlock, and then use the saved page handle to iterate over >>> the page bitmap. >> >> Main reason to keep Spin lock out of this function to avoid changes >> inside this function, and also this function takes local iterator as >> input which don't have spin lock reference to it. But that can be >> changed, we can pass shared iterator to it. >> >> I will think about this logic and try to update in next version. > Still this issue is not addressed. > Logic inside tbm_iterate is using same variable, like spageptr, > multiple places. IMHO this complete logic needs to be done under one > spin lock. I think we both agree on the part that the mutex handle can be passed to tbm_iterate() to do this. I am yet to give more thought on how clumsy it will be if we add SpinlockRelease() calls in intermediate places in the function rather than in the end. > >> >>> >>> prefetch_pages() call under Spinlock : >>> Here again, prefetch_pages() is called while pbminfo->prefetch_mutex >>> Spinlock is held. Effectively, heavy functions like PrefetchBuffer() >>> would get called while under the Spinlock. These can even ereport(). >>> One option is to use mutex lock for this purpose. But I think that >>> would slow things down. Moreover, the complete set of prefetch pages >>> would be scanned by a single worker, and others might wait for this >>> one. Instead, what I am thinking is: grab the pbminfo->prefetch_mutex >>> Spinlock only while incrementing pbminfo->prefetch_pages. The rest >>> part viz : iterating over the prefetch pages, and doing the >>> PrefetchBuffer() need not be synchronised using this >>> pgbinfo->prefetch_mutex Spinlock. pbms_parallel_iterate() already has >>> its own iterator spinlock. Only thing is, workers may not do the >>> actual PrefetchBuffer() sequentially. One of them might shoot ahead >>> and prefetch 3-4 pages while the other is lagging with the >>> sequentially lesser page number; but I believe this is fine, as long >>> as they all prefetch all the required blocks. >> >> I agree with your point, will try to fix this as well. > > I have fixed this part. This looks good now. Further points below = nodeBItmapHeapscan.c == In BitmapHeapNext(), the following code is relevant only for tbm returned from MultiExecProcNode(). if (!tbm || !IsA(tbm, TIDBitmap)) elog(ERROR, "unrecognized result from subplan"); So it should be moved just below MultiExecProcNode(), so that it does not get called when it is created from tbm_create(). -- BitmapHeapScanState->is_leader field looks unnecessary. Instead, a local variable is_leader in BitmapHeapNext() will solve the purpose. This is because is_leader is used only in BitmapHeapNext(). -- In BitmapHeapNext(), just befo
Re: [HACKERS] Parallel bitmap heap scan
On Mon, Dec 26, 2016 at 3:14 PM, Dilip Kumar wrote: > Other the another option is, that we can always make caller to provide > an allocator. But this way every new user for simple hash need to take > care of having allocator. > > What is your opinion? Attached is the new version of the patch which implements it the way I described. > > >>This also needs docs, including a warning that just >> using an allocator in shared memory does *NOT* allow the hash table to be >> used in shared memory in the general case. > > Make sense. Added the Warning. I have also fixed some bug in parallel bitmap heap scan (path.parallel_workers was not initialised before calling cost_bitmap_heap_scan in some cases, so it was taking the uninitialized value). Patch attached. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com hash-support-alloc-free-v6.patch Description: Binary data parallel-bitmap-heap-scan-v6.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers