On Tue, Jan 21, 2020 at 6:20 PM Thomas Munro <thomas.mu...@gmail.com> wrote:
> On Fri, Jan 10, 2020 at 1:52 PM Deng, Gang <gang.d...@intel.com> wrote:
> > Thank you for the comment. Yes, I agree the alternative of using 
> > '(!parallel)', so that no need to test the bit. Will someone submit patch 
> > to for it accordingly?
>
> Here's a patch like that.

Pushed.  Thanks again for the report!

I didn't try the TPC-DS query, but could see a small improvement from
this on various simple queries, especially with a fairly small hash
table and a large outer relation, when many cores are probing.

(Off topic for this thread, but after burning a few hours on a 72-way
box investigating various things including this, I was reminded of the
performance drop-off for joins with large hash tables that happens
somewhere around 8-16 workers.  That's because we can't give 32KB
chunks out fast enough, and if you increase the chunk size it helps
only a bit.  That really needs some work; maybe something like a
separation of reservation and allocation, so that multiple segments
can be created in parallel while respecting limits, or something like
that.  The other thing I was reminded of: FreeBSD blows Linux out of
the water on big parallel hash joins on identical hardware; I didn't
dig further today but I suspect this may be down to lack of huge pages
(TLB misses), and perhaps also those pesky fallocate() calls.  I'm
starting to wonder if we should have a new GUC shared_work_mem that
reserves a wodge of shm in the main region, and hand out 'fast DSM
segments' from there, or some other higher level abstraction that's
wired into the resource release system; they would benefit from
huge_pages=try on Linux, they'd be entirely allocated (in the VM
sense) and there'd be no system calls, though admittedly there'd be
more ways for things to go wrong...)


Reply via email to