Re: [PATCH] Resolve Parallel Hash Join Performance Issue
On Tue, Jan 21, 2020 at 6:20 PM Thomas Munro wrote: > On Fri, Jan 10, 2020 at 1:52 PM Deng, Gang wrote: > > Thank you for the comment. Yes, I agree the alternative of using > > '(!parallel)', so that no need to test the bit. Will someone submit patch > > to for it accordingly? > > Here's a patch like that. Pushed. Thanks again for the report! I didn't try the TPC-DS query, but could see a small improvement from this on various simple queries, especially with a fairly small hash table and a large outer relation, when many cores are probing. (Off topic for this thread, but after burning a few hours on a 72-way box investigating various things including this, I was reminded of the performance drop-off for joins with large hash tables that happens somewhere around 8-16 workers. That's because we can't give 32KB chunks out fast enough, and if you increase the chunk size it helps only a bit. That really needs some work; maybe something like a separation of reservation and allocation, so that multiple segments can be created in parallel while respecting limits, or something like that. The other thing I was reminded of: FreeBSD blows Linux out of the water on big parallel hash joins on identical hardware; I didn't dig further today but I suspect this may be down to lack of huge pages (TLB misses), and perhaps also those pesky fallocate() calls. I'm starting to wonder if we should have a new GUC shared_work_mem that reserves a wodge of shm in the main region, and hand out 'fast DSM segments' from there, or some other higher level abstraction that's wired into the resource release system; they would benefit from huge_pages=try on Linux, they'd be entirely allocated (in the VM sense) and there'd be no system calls, though admittedly there'd be more ways for things to go wrong...)
Re: [PATCH] Resolve Parallel Hash Join Performance Issue
(Replies to both Gang and Tom below). On Fri, Jan 10, 2020 at 1:52 PM Deng, Gang wrote: > Thank you for the comment. Yes, I agree the alternative of using > '(!parallel)', so that no need to test the bit. Will someone submit patch to > for it accordingly? Here's a patch like that. On Fri, Jan 10, 2020 at 3:43 AM Tom Lane wrote: > Thomas Munro writes: > > Right, I see. The funny thing is that the match bit is not even used > > in this query (it's used for right and full hash join, and those > > aren't supported for parallel joins yet). Hmm. So, instead of the > > test you proposed, an alternative would be to use if (!parallel). > > That's a value that will be constant-folded, so that there will be no > > branch in the generated code (see the pg_attribute_always_inline > > trick). If, in a future release, we need the match bit for parallel > > hash join because we add parallel right/full hash join support, we > > could do it the way you showed, but only if it's one of those join > > types, using another constant parameter. > > Can we base the test off the match type today, and avoid leaving > something that will need to be fixed later? I agree that it'd be nicer to use the logically correct thing, namely a test of HJ_FILL_INNER(node), but that'd be a run-time check. I'd like to back-patch this and figured that we don't want to add new branches too casually. I have an experimental patch where "fill_inner" and "fill_outer" are compile-time constants and you can skip various bits of code without branching (part of a larger experiment to figure out which of many parameters are worth specialising at a cost of a couple of KB of text per combination, including the ability to use wider hashes so that monster sized joins work better). Then I could test the logically correct thing explicitly without branches. 0001-Avoid-unnecessary-shmem-writes-in-Parallel-Hash-Join.patch Description: Binary data
RE: [PATCH] Resolve Parallel Hash Join Performance Issue
Regarding to the reason of setting bit was not cheap anymore in parallel join. As I explain in my original mail, it is because 'false sharing cache coherence'. In short word, setting of the bit will cause the whole cache line (64 bytes) dirty. So that all CPU cores contain the cache line have to load it again, which will waste much cpu time. Article https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads explain more detail. -Original Message- From: Tom Lane Sent: Thursday, January 9, 2020 10:43 PM To: Thomas Munro Cc: Deng, Gang ; pgsql-hack...@postgresql.org Subject: Re: [PATCH] Resolve Parallel Hash Join Performance Issue Thomas Munro writes: > Right, I see. The funny thing is that the match bit is not even used > in this query (it's used for right and full hash join, and those > aren't supported for parallel joins yet). Hmm. So, instead of the > test you proposed, an alternative would be to use if (!parallel). > That's a value that will be constant-folded, so that there will be no > branch in the generated code (see the pg_attribute_always_inline > trick). If, in a future release, we need the match bit for parallel > hash join because we add parallel right/full hash join support, we > could do it the way you showed, but only if it's one of those join > types, using another constant parameter. Can we base the test off the match type today, and avoid leaving something that will need to be fixed later? I'm pretty sure that the existing coding is my fault, and that it's like that because I reasoned that setting the bit was too cheap to justify having a test-and-branch around it. Apparently that's not true anymore in a parallel join, but I have to say that it's unclear why. In any case, the reasoning probably still holds good in non-parallel cases, so it'd be a shame to introduce a run-time test if we can avoid it. regards, tom lane
RE: [PATCH] Resolve Parallel Hash Join Performance Issue
Thank you for the comment. Yes, I agree the alternative of using '(!parallel)', so that no need to test the bit. Will someone submit patch to for it accordingly? -Original Message- From: Thomas Munro Sent: Thursday, January 9, 2020 6:04 PM To: Deng, Gang Cc: pgsql-hack...@postgresql.org Subject: Re: [PATCH] Resolve Parallel Hash Join Performance Issue On Thu, Jan 9, 2020 at 10:04 PM Deng, Gang wrote: > Attached is a patch to resolve parallel hash join performance issue. This is > my first time to contribute patch to PostgreSQL community, I referred one of > previous thread as template to report the issue and patch. Please let me know > if need more information of the problem and patch. Thank you very much for investigating this and for your report. > HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple)); > > changed to: > > if > (!HeapTupleHeaderHasMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple))) > > { > > > HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple)); > > } > > Compared with original code, modified code can avoid unnecessary write to > memory/cache. Right, I see. The funny thing is that the match bit is not even used in this query (it's used for right and full hash join, and those aren't supported for parallel joins yet). Hmm. So, instead of the test you proposed, an alternative would be to use if (!parallel). That's a value that will be constant-folded, so that there will be no branch in the generated code (see the pg_attribute_always_inline trick). If, in a future release, we need the match bit for parallel hash join because we add parallel right/full hash join support, we could do it the way you showed, but only if it's one of those join types, using another constant parameter. > D. Result > > With the modified code, performance of hash join operation can scale better > with number of threads. Here is result of query02 after patch. For example, > performance improved ~2.5x when run 28 threads. > > number of thread:1 48 1628 > time used(sec):465.1 193.1 97.9 55.9 41 Wow. That is a very nice improvement.
Re: [PATCH] Resolve Parallel Hash Join Performance Issue
Thomas Munro writes: > Right, I see. The funny thing is that the match bit is not even used > in this query (it's used for right and full hash join, and those > aren't supported for parallel joins yet). Hmm. So, instead of the > test you proposed, an alternative would be to use if (!parallel). > That's a value that will be constant-folded, so that there will be no > branch in the generated code (see the pg_attribute_always_inline > trick). If, in a future release, we need the match bit for parallel > hash join because we add parallel right/full hash join support, we > could do it the way you showed, but only if it's one of those join > types, using another constant parameter. Can we base the test off the match type today, and avoid leaving something that will need to be fixed later? I'm pretty sure that the existing coding is my fault, and that it's like that because I reasoned that setting the bit was too cheap to justify having a test-and-branch around it. Apparently that's not true anymore in a parallel join, but I have to say that it's unclear why. In any case, the reasoning probably still holds good in non-parallel cases, so it'd be a shame to introduce a run-time test if we can avoid it. regards, tom lane
Re: [PATCH] Resolve Parallel Hash Join Performance Issue
On Thu, Jan 9, 2020 at 10:04 PM Deng, Gang wrote: > Attached is a patch to resolve parallel hash join performance issue. This is > my first time to contribute patch to PostgreSQL community, I referred one of > previous thread as template to report the issue and patch. Please let me know > if need more information of the problem and patch. Thank you very much for investigating this and for your report. > HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple)); > > changed to: > > if (!HeapTupleHeaderHasMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple))) > > { > > HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple)); > > } > > Compared with original code, modified code can avoid unnecessary write to > memory/cache. Right, I see. The funny thing is that the match bit is not even used in this query (it's used for right and full hash join, and those aren't supported for parallel joins yet). Hmm. So, instead of the test you proposed, an alternative would be to use if (!parallel). That's a value that will be constant-folded, so that there will be no branch in the generated code (see the pg_attribute_always_inline trick). If, in a future release, we need the match bit for parallel hash join because we add parallel right/full hash join support, we could do it the way you showed, but only if it's one of those join types, using another constant parameter. > D. Result > > With the modified code, performance of hash join operation can scale better > with number of threads. Here is result of query02 after patch. For example, > performance improved ~2.5x when run 28 threads. > > number of thread:1 48 1628 > time used(sec):465.1 193.1 97.9 55.9 41 Wow. That is a very nice improvement.