Re: [PATCH] Resolve Parallel Hash Join Performance Issue

2020-01-26 Thread Thomas Munro
On Tue, Jan 21, 2020 at 6:20 PM Thomas Munro  wrote:
> On Fri, Jan 10, 2020 at 1:52 PM Deng, Gang  wrote:
> > Thank you for the comment. Yes, I agree the alternative of using 
> > '(!parallel)', so that no need to test the bit. Will someone submit patch 
> > to for it accordingly?
>
> Here's a patch like that.

Pushed.  Thanks again for the report!

I didn't try the TPC-DS query, but could see a small improvement from
this on various simple queries, especially with a fairly small hash
table and a large outer relation, when many cores are probing.

(Off topic for this thread, but after burning a few hours on a 72-way
box investigating various things including this, I was reminded of the
performance drop-off for joins with large hash tables that happens
somewhere around 8-16 workers.  That's because we can't give 32KB
chunks out fast enough, and if you increase the chunk size it helps
only a bit.  That really needs some work; maybe something like a
separation of reservation and allocation, so that multiple segments
can be created in parallel while respecting limits, or something like
that.  The other thing I was reminded of: FreeBSD blows Linux out of
the water on big parallel hash joins on identical hardware; I didn't
dig further today but I suspect this may be down to lack of huge pages
(TLB misses), and perhaps also those pesky fallocate() calls.  I'm
starting to wonder if we should have a new GUC shared_work_mem that
reserves a wodge of shm in the main region, and hand out 'fast DSM
segments' from there, or some other higher level abstraction that's
wired into the resource release system; they would benefit from
huge_pages=try on Linux, they'd be entirely allocated (in the VM
sense) and there'd be no system calls, though admittedly there'd be
more ways for things to go wrong...)




Re: [PATCH] Resolve Parallel Hash Join Performance Issue

2020-01-20 Thread Thomas Munro
(Replies to both Gang and Tom below).

On Fri, Jan 10, 2020 at 1:52 PM Deng, Gang  wrote:
> Thank you for the comment. Yes, I agree the alternative of using 
> '(!parallel)', so that no need to test the bit. Will someone submit patch to 
> for it accordingly?

Here's a patch like that.

On Fri, Jan 10, 2020 at 3:43 AM Tom Lane  wrote:
> Thomas Munro  writes:
> > Right, I see.  The funny thing is that the match bit is not even used
> > in this query (it's used for right and full hash join, and those
> > aren't supported for parallel joins yet).  Hmm.  So, instead of the
> > test you proposed, an alternative would be to use if (!parallel).
> > That's a value that will be constant-folded, so that there will be no
> > branch in the generated code (see the pg_attribute_always_inline
> > trick).  If, in a future release, we need the match bit for parallel
> > hash join because we add parallel right/full hash join support, we
> > could do it the way you showed, but only if it's one of those join
> > types, using another constant parameter.
>
> Can we base the test off the match type today, and avoid leaving
> something that will need to be fixed later?

I agree that it'd be nicer to use the logically correct thing, namely
a test of HJ_FILL_INNER(node), but that'd be a run-time check.  I'd
like to back-patch this and figured that we don't want to add new
branches too casually.

I have an experimental patch where "fill_inner" and "fill_outer" are
compile-time constants and you can skip various bits of code without
branching (part of a larger experiment to figure out which of many
parameters are worth specialising at a cost of a couple of KB of text
per combination, including the ability to use wider hashes so that
monster sized joins work better).  Then I could test the logically
correct thing explicitly without branches.


0001-Avoid-unnecessary-shmem-writes-in-Parallel-Hash-Join.patch
Description: Binary data


RE: [PATCH] Resolve Parallel Hash Join Performance Issue

2020-01-09 Thread Deng, Gang
Regarding to the reason of setting bit was not cheap anymore in parallel join. 
As I explain in my original mail, it is because 'false sharing cache 
coherence'. In short word, setting of the bit will cause the whole cache line 
(64 bytes) dirty. So that all CPU cores contain the cache line have to load it 
again, which will waste much cpu time. Article 
https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads
 explain more detail.

-Original Message-
From: Tom Lane  
Sent: Thursday, January 9, 2020 10:43 PM
To: Thomas Munro 
Cc: Deng, Gang ; pgsql-hack...@postgresql.org
Subject: Re: [PATCH] Resolve Parallel Hash Join Performance Issue

Thomas Munro  writes:
> Right, I see.  The funny thing is that the match bit is not even used 
> in this query (it's used for right and full hash join, and those 
> aren't supported for parallel joins yet).  Hmm.  So, instead of the 
> test you proposed, an alternative would be to use if (!parallel).
> That's a value that will be constant-folded, so that there will be no 
> branch in the generated code (see the pg_attribute_always_inline 
> trick).  If, in a future release, we need the match bit for parallel 
> hash join because we add parallel right/full hash join support, we 
> could do it the way you showed, but only if it's one of those join 
> types, using another constant parameter.

Can we base the test off the match type today, and avoid leaving something that 
will need to be fixed later?

I'm pretty sure that the existing coding is my fault, and that it's like that 
because I reasoned that setting the bit was too cheap to justify having a 
test-and-branch around it.  Apparently that's not true anymore in a parallel 
join, but I have to say that it's unclear why.  In any case, the reasoning 
probably still holds good in non-parallel cases, so it'd be a shame to 
introduce a run-time test if we can avoid it.

regards, tom lane




RE: [PATCH] Resolve Parallel Hash Join Performance Issue

2020-01-09 Thread Deng, Gang
Thank you for the comment. Yes, I agree the alternative of using '(!parallel)', 
so that no need to test the bit. Will someone submit patch to for it 
accordingly?

-Original Message-
From: Thomas Munro  
Sent: Thursday, January 9, 2020 6:04 PM
To: Deng, Gang 
Cc: pgsql-hack...@postgresql.org
Subject: Re: [PATCH] Resolve Parallel Hash Join Performance Issue

On Thu, Jan 9, 2020 at 10:04 PM Deng, Gang  wrote:
> Attached is a patch to resolve parallel hash join performance issue. This is 
> my first time to contribute patch to PostgreSQL community, I referred one of 
> previous thread as template to report the issue and patch. Please let me know 
> if need more information of the problem and patch.

Thank you very much for investigating this and for your report.

> HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));
>
> changed to:
>
> if 
> (!HeapTupleHeaderHasMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple)))
>
> {
>
> 
> HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));
>
> }
>
> Compared with original code, modified code can avoid unnecessary write to 
> memory/cache.

Right, I see.  The funny thing is that the match bit is not even used in this 
query (it's used for right and full hash join, and those aren't supported for 
parallel joins yet).  Hmm.  So, instead of the test you proposed, an 
alternative would be to use if (!parallel).
That's a value that will be constant-folded, so that there will be no branch in 
the generated code (see the pg_attribute_always_inline trick).  If, in a future 
release, we need the match bit for parallel hash join because we add parallel 
right/full hash join support, we could do it the way you showed, but only if 
it's one of those join types, using another constant parameter.

> D. Result
>
> With the modified code, performance of hash join operation can scale better 
> with number of threads. Here is result of query02 after patch. For example, 
> performance improved ~2.5x when run 28 threads.
>
> number of thread:1   48 1628
> time used(sec):465.1  193.1   97.9   55.9  41

Wow.  That is a very nice improvement.


Re: [PATCH] Resolve Parallel Hash Join Performance Issue

2020-01-09 Thread Tom Lane
Thomas Munro  writes:
> Right, I see.  The funny thing is that the match bit is not even used
> in this query (it's used for right and full hash join, and those
> aren't supported for parallel joins yet).  Hmm.  So, instead of the
> test you proposed, an alternative would be to use if (!parallel).
> That's a value that will be constant-folded, so that there will be no
> branch in the generated code (see the pg_attribute_always_inline
> trick).  If, in a future release, we need the match bit for parallel
> hash join because we add parallel right/full hash join support, we
> could do it the way you showed, but only if it's one of those join
> types, using another constant parameter.

Can we base the test off the match type today, and avoid leaving
something that will need to be fixed later?

I'm pretty sure that the existing coding is my fault, and that it's
like that because I reasoned that setting the bit was too cheap
to justify having a test-and-branch around it.  Apparently that's
not true anymore in a parallel join, but I have to say that it's
unclear why.  In any case, the reasoning probably still holds good
in non-parallel cases, so it'd be a shame to introduce a run-time
test if we can avoid it.

regards, tom lane




Re: [PATCH] Resolve Parallel Hash Join Performance Issue

2020-01-09 Thread Thomas Munro
On Thu, Jan 9, 2020 at 10:04 PM Deng, Gang  wrote:
> Attached is a patch to resolve parallel hash join performance issue. This is 
> my first time to contribute patch to PostgreSQL community, I referred one of 
> previous thread as template to report the issue and patch. Please let me know 
> if need more information of the problem and patch.

Thank you very much for investigating this and for your report.

> HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));
>
> changed to:
>
> if (!HeapTupleHeaderHasMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple)))
>
> {
>
> HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));
>
> }
>
> Compared with original code, modified code can avoid unnecessary write to 
> memory/cache.

Right, I see.  The funny thing is that the match bit is not even used
in this query (it's used for right and full hash join, and those
aren't supported for parallel joins yet).  Hmm.  So, instead of the
test you proposed, an alternative would be to use if (!parallel).
That's a value that will be constant-folded, so that there will be no
branch in the generated code (see the pg_attribute_always_inline
trick).  If, in a future release, we need the match bit for parallel
hash join because we add parallel right/full hash join support, we
could do it the way you showed, but only if it's one of those join
types, using another constant parameter.

> D. Result
>
> With the modified code, performance of hash join operation can scale better 
> with number of threads. Here is result of query02 after patch. For example, 
> performance improved ~2.5x when run 28 threads.
>
> number of thread:1   48 1628
> time used(sec):465.1  193.1   97.9   55.9  41

Wow.  That is a very nice improvement.