Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-06-27 Thread Thomas Munro
On Mon, May 22, 2017 at 6:39 AM, Thomas Munro
 wrote:
> On Thu, Apr 27, 2017 at 11:03 AM, Thomas Munro
>  wrote:
>> On Thu, Apr 27, 2017 at 5:13 AM, Oleg Golovanov  wrote:
>>> Can you actualize your patch set? The error got from
>>> 0010-hj-parallel-v12.patch.
>>
>> I really should get around to setting up a cron job to tell me about
>> that.  Here's a rebased version.
>
> Rebased.

Rebased for the recent re-indent and shm_toc API change; no functional
changes in this version.

(I have a new patch set in the pipeline adding the skew optimisation
and some other things, more on that soon.)

-- 
Thomas Munro
http://www.enterprisedb.com


parallel-shared-hash-v15.patchset.tgz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-05-21 Thread Thomas Munro
On Thu, Apr 27, 2017 at 11:03 AM, Thomas Munro
 wrote:
> On Thu, Apr 27, 2017 at 5:13 AM, Oleg Golovanov  wrote:
>> Can you actualize your patch set? The error got from
>> 0010-hj-parallel-v12.patch.
>
> I really should get around to setting up a cron job to tell me about
> that.  Here's a rebased version.

Rebased.

-- 
Thomas Munro
http://www.enterprisedb.com


parallel-shared-hash-v14.tgz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-04-26 Thread Thomas Munro
On Thu, Apr 27, 2017 at 5:13 AM, Oleg Golovanov  wrote:
> Can you actualize your patch set? The error got from
> 0010-hj-parallel-v12.patch.

I really should get around to setting up a cron job to tell me about
that.  Here's a rebased version.

The things currently on my list for this patch are:

1.  Implement the skew optimisation.
2.  Consider Andres's suggestion of splitting MultiExecHash into two
functions, serial and parallel version, rather than having all those
conditional blocks in there.
3.  Figure out whether the shared BufFile stuff I propose would work
well for Peter Geoghegan's parallel tuple sort patch, by trying it
(I've made a start, more soon).
4.  Figure out how the costing model needs to be tweaked, probably
based on experimentation.

I'm taking a short break to work on other things right now but will
post a version with those changes soon.

-- 
Thomas Munro
http://www.enterprisedb.com


parallel-shared-hash-v13.tgz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-04-26 Thread Oleg Golovanov
Hi.

Thanks for rebased patch set v12. Currently I try to use this patch on my new 
test site and get following:

Hmm...  The next patch looks like a unified diff to me...
The text leading up to this was:
--
|diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
|index bdf15621c83..e9db8880161 100644
|--- a/src/include/access/parallel.h
|+++ b/src/include/access/parallel.h
--
patching file src/include/access/parallel.h
Using Plan A...
Hunk #1 FAILED at 58.
1 out of 1 hunk FAILED -- saving rejects to file 
src/include/access/parallel.h.rej

Can you actualize your patch set? The error got from 0010-hj-parallel-v12.patch.

Best Regards,

Oleg Golovanov
Moscow, Russia

>Четверг, 13 апреля 2017, 13:49 +03:00 от Thomas Munro 
>:
>
>On Thu, Apr 13, 2017 at 10:04 PM, Oleg Golovanov < rent...@mail.ru > wrote:
>> bash-4.2$ grep Hunk *.log | grep FAILED
>> 0005-hj-leader-gate-v11.patch.log:Hunk #1 FAILED at 14.
>> 0010-hj-parallel-v11.patch.log:Hunk #2 FAILED at 2850.
>> 0010-hj-parallel-v11.patch.log:Hunk #1 FAILED at 21.
>> 0010-hj-parallel-v11.patch.log:Hunk #3 FAILED at 622.
>> 0010-hj-parallel-v11.patch.log:Hunk #6 FAILED at 687.
>> 0010-hj-parallel-v11.patch.log:Hunk #1 FAILED at 21.
>> 0010-hj-parallel-v11.patch.log:Hunk #3 FAILED at 153.
>
>Hi Oleg
>
>Thanks for looking at this.  It conflicted with commit 9c7f5229.  Here
>is a rebased patch set.
>
>This version also removes some code for dealing with transient record
>types which didn't work out.  I'm trying to deal with that problem
>separately[1] and in a general way so that the parallel hash join
>patch doesn't have to deal with it at all.
>
>[1]  
>https://www.postgresql.org/message-id/CAEepm=0ZtQ-SpsgCyzzYpsXS6e=kzwqk3g5ygn3mdv7a8da...@mail.gmail.com
>
>-- 
>Thomas Munro
>http://www.enterprisedb.com
>
>-- 
>Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>To make changes to your subscription:
>http://www.postgresql.org/mailpref/pgsql-hackers



Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-04-13 Thread Thomas Munro
On Thu, Apr 13, 2017 at 10:04 PM, Oleg Golovanov  wrote:
> bash-4.2$ grep Hunk *.log | grep FAILED
> 0005-hj-leader-gate-v11.patch.log:Hunk #1 FAILED at 14.
> 0010-hj-parallel-v11.patch.log:Hunk #2 FAILED at 2850.
> 0010-hj-parallel-v11.patch.log:Hunk #1 FAILED at 21.
> 0010-hj-parallel-v11.patch.log:Hunk #3 FAILED at 622.
> 0010-hj-parallel-v11.patch.log:Hunk #6 FAILED at 687.
> 0010-hj-parallel-v11.patch.log:Hunk #1 FAILED at 21.
> 0010-hj-parallel-v11.patch.log:Hunk #3 FAILED at 153.

Hi Oleg

Thanks for looking at this.  It conflicted with commit 9c7f5229.  Here
is a rebased patch set.

This version also removes some code for dealing with transient record
types which didn't work out.  I'm trying to deal with that problem
separately[1] and in a general way so that the parallel hash join
patch doesn't have to deal with it at all.

[1] 
https://www.postgresql.org/message-id/CAEepm=0ZtQ-SpsgCyzzYpsXS6e=kzwqk3g5ygn3mdv7a8da...@mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com


parallel-shared-hash-v12.tgz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-04-13 Thread Oleg Golovanov
Hi.

I got errors of patching on CentOS 7:

bash-4.2$ grep Hunk *.log | grep FAILED
0005-hj-leader-gate-v11.patch.log:Hunk #1 FAILED at 14.
0010-hj-parallel-v11.patch.log:Hunk #2 FAILED at 2850.
0010-hj-parallel-v11.patch.log:Hunk #1 FAILED at 21.
0010-hj-parallel-v11.patch.log:Hunk #3 FAILED at 622.
0010-hj-parallel-v11.patch.log:Hunk #6 FAILED at 687.
0010-hj-parallel-v11.patch.log:Hunk #1 FAILED at 21.
0010-hj-parallel-v11.patch.log:Hunk #3 FAILED at 153. What is wrong? The 
sources were clean:

bash-4.2$ git status
# On branch master
nothing to commit, working directory clean

I was patching by the command:
patch -b -i 
../.patches/parallel-shared-hash-v11/0001-hj-refactor-memory-accounting-v11.patch
 -p1 --verbose > 
../.patches/parallel-shared-hash-v11/0001-hj-refactor-memory-accounting-v11.patch.log
patch -b -i 
../.patches/parallel-shared-hash-v11/0002-hj-refactor-batch-increases-v11.patch 
-p1 --verbose > 
../.patches/parallel-shared-hash-v11/0002-hj-refactor-batch-increases-v11.patch.log
patch -b -i 
../.patches/parallel-shared-hash-v11/0003-hj-refactor-unmatched-v11.patch -p1 
--verbose > 
../.patches/parallel-shared-hash-v11/0003-hj-refactor-unmatched-v11.patch.log
patch -b -i ../.patches/parallel-shared-hash-v11/0004-hj-barrier-v11.patch -p1 
--verbose > ../.patches/parallel-shared-hash-v11/0004-hj-barrier-v11.patch.log
patch -b -i ../.patches/parallel-shared-hash-v11/0005-hj-leader-gate-v11.patch 
-p1 --verbose > 
../.patches/parallel-shared-hash-v11/0005-hj-leader-gate-v11.patch.log
patch -b -i 
../.patches/parallel-shared-hash-v11/0006-hj-let-node-have-seg-in-worker-v11.patch
 -p1 --verbose > 
../.patches/parallel-shared-hash-v11/0006-hj-let-node-have-seg-in-worker-v11.patch.log
patch -b -i 
../.patches/parallel-shared-hash-v11/0007-hj-remove-buf-file-is-temp-v11.patch 
-p1 --verbose > 
../.patches/parallel-shared-hash-v11/0007-hj-remove-buf-file-is-temp-v11.patch.log
patch -b -i ../.patches/parallel-shared-hash-v11/0008-hj-buf-file-set-v11.patch 
-p1 --verbose > 
../.patches/parallel-shared-hash-v11/0008-hj-buf-file-set-v11.patch.log
patch -b -i 
../.patches/parallel-shared-hash-v11/0009-hj-shared-tuplestore-v11.patch -p1 
--verbose > 
../.patches/parallel-shared-hash-v11/0009-hj-shared-tuplestore-v11.patch.log
patch -b -i ../.patches/parallel-shared-hash-v11/0010-hj-parallel-v11.patch -p1 
--verbose > ../.patches/parallel-shared-hash-v11/0010-hj-parallel-v11.patch.log 
Best Regards,

Oleg Golovanov
Moscow, Russia

>Вторник,  4 апреля 2017, 0:28 +03:00 от Thomas Munro 
>:
>
>On Tue, Apr 4, 2017 at 9:11 AM, Andres Freund < and...@anarazel.de > wrote:
>> Hi,
>>
>> On 2017-03-31 17:53:12 +1300, Thomas Munro wrote:
>>> Thanks very much to Rafia for testing, and to Andres for his copious
>>> review feedback.  Here's a new version.  Changes:
>>
>> I unfortunately think it's too late to get this into v10.  There's still
>> heavy development going on, several pieces changed quite noticeably
>> since the start of the CF and there's still features missing.  Hence I
>> think this unfortunately has to be pushed - as much as I'd have liked to
>> have this in 10.
>>
>> Do you agree?
>
>Agreed.
>
>Thank you very much Andres, Ashutosh, Peter, Rafia and Robert for all
>the review, testing and discussion so far.
>
>-- 
>Thomas Munro
>http://www.enterprisedb.com
>


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-04-03 Thread Thomas Munro
On Tue, Apr 4, 2017 at 9:11 AM, Andres Freund  wrote:
> Hi,
>
> On 2017-03-31 17:53:12 +1300, Thomas Munro wrote:
>> Thanks very much to Rafia for testing, and to Andres for his copious
>> review feedback.  Here's a new version.  Changes:
>
> I unfortunately think it's too late to get this into v10.  There's still
> heavy development going on, several pieces changed quite noticeably
> since the start of the CF and there's still features missing.  Hence I
> think this unfortunately has to be pushed - as much as I'd have liked to
> have this in 10.
>
> Do you agree?

Agreed.

Thank you very much Andres, Ashutosh, Peter, Rafia and Robert for all
the review, testing and discussion so far.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-04-03 Thread Andres Freund
Hi,

On 2017-03-31 17:53:12 +1300, Thomas Munro wrote:
> Thanks very much to Rafia for testing, and to Andres for his copious
> review feedback.  Here's a new version.  Changes:

I unfortunately think it's too late to get this into v10.  There's still
heavy development going on, several pieces changed quite noticeably
since the start of the CF and there's still features missing.  Hence I
think this unfortunately has to be pushed - as much as I'd have liked to
have this in 10.

Do you agree?

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-03-31 Thread Andres Freund
Hi Thomas,

On 2017-03-31 17:53:12 +1300, Thomas Munro wrote:
> Thanks very much to Rafia for testing, and to Andres for his copious
> review feedback.  Here's a new version.  Changes:

I've not looked at that aspect, but one thing I think would be good is
to first add patch that increases coverage of nodeHash[join].c to nearly
100%.  There's currently significant bits of nodeHash.c that aren't
covered (skew optimization, large tuples).

https://coverage.postgresql.org/src/backend/executor/nodeHash.c.gcov.html
https://coverage.postgresql.org/src/backend/executor/nodeHashjoin.c.gcov.html

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-03-30 Thread Thomas Munro
Hi hackers,

Thanks very much to Rafia for testing, and to Andres for his copious
review feedback.  Here's a new version.  Changes:

1.  Keep all the backing files that are part of a BufFileSet in
subdirectories, as suggested by Andres.  Now, instead of that
unpopular logic for scanning ranges of possible file paths to delete,
we can just blow away whole directories that group sets of related
files.

2.  Don't expose 'participant' and 'partition' concepts, Andres didn't
like much, in the BufFile API.  There is a new concept 'stripe' which
client code of BufFileSet can use to specify the participant number in
a more general way without saying so: it's really just a way to spread
files over tablespaces.  I'm not sure if tablespaces are really used
that way much, but it seemed like Peter wasn't going to be too happy
with a proposal that didn't do *something* to respect the existing
temp_tablespaces GUC beahviour (and he'd be right).  But I didn't
think it would make any kind of sense at all to stripe by 1GB segments
as private BufFiles do when writing from multiple processes, as I have
argued elsewhere, hence this scheme.

The 'qunique' function used here (basically poor man's std::unique) is
one I proposed earlier, with the name suggested by Tom Lane:

See 
https://www.postgresql.org/message-id/flat/CAEepm%3D2vmFTNpAmwbGGD2WaryM6T3hSDVKQPfUwjdD_5XY6vAA%40mail.gmail.com
.

3.  Merged the single-batch and multi-batch patches into one.
EarlierI had the idea that it was easier to review them in layers
since I hoped people might catch a glimpse of the central simplicity
without being hit by a wall of multi-batch logic, but since Andres is
reviewing and disagrees, I give you 0010-hj-parallel-v11.patch which
weighs in at 32 files changed, 2278 insertions(+), 250 deletions(-).

4.  Moved the DSM handling to the every end of resowner.c's cleanup.
Peter pointed out that it would otherwise happen before fd.c Files are
closed.  He was concerned about a different aspect of that which I'm
not sure I fully understand, but at the very least it seemed to
represent a significant problem for this design on Windows.  I
discussed this briefly with Robert off-list and he told me that there
is probably no good reason for the ordering that we have, and what's
more, there may be good arguments even outside this case for DSM
segments being cleaned up as late as possible, now that they contain
shared control information and not just tuple data as once had been
imagined.  I can't think of any reason why this would not be safe.
Can you?

5.  The empty inner relation optimisation implemented.

Some smaller changes and miles of feedback inline below:

On Mon, Mar 27, 2017 at 11:03 AM, Thomas Munro
 wrote:
> On Mon, Mar 27, 2017 at 9:41 AM, Andres Freund  wrote:
>> SharedBufFile allows temporary files to be created by one backend and
>> then exported for read-only access by other backends, with clean-up
>> managed by reference counting associated with a DSM segment.  This includes
>> changes to fd.c and buffile.c to support new kinds of temporary file.
>>
>>
>> diff --git a/src/backend/storage/file/buffile.c 
>> b/src/backend/storage/file/buffile.c
>> index 4ca0ea4..a509c05 100644
>> --- a/src/backend/storage/file/buffile.c
>> +++ b/src/backend/storage/file/buffile.c
>>
>> I think the new facilities should be explained in the file's header.
>
> Will do.

Done.

>> @@ -68,9 +71,10 @@ struct BufFile
>>  * avoid making redundant FileSeek calls.
>>  */
>>
>> -   boolisTemp; /* can only add files if 
>> this is TRUE */
>> +   boolisSegmented;/* can only add files if this is 
>> TRUE */
>>
>> That's a bit of a weird and uncommented upon change.
>
> I was trying to cut down on the number of places we use the word
> 'temporary' to activate various different behaviours.  In this case,
> the only thing it controls is whether the BufFile is backed by one
> single fd.c File or many segments, so I figured it should be renamed.
>
> As Peter and you have pointed out, there may be a case for removing it
> altogether.

Done in 0007-hj-remove-buf-file-is-temp-v11.patch.

>> @@ -79,6 +83,8 @@ struct BufFile
>>  */
>> ResourceOwner resowner;
>>
>> +   BufFileTag  tag;/* for discoverability 
>> between backends */
>>
>> Not perfectly happy with the name tag here, the name is a bit too
>> similar to BufferTag - something quite different.
>
> Yeah, will rename.

Done.  That existed only because I had sharedbuffile.c which needed
special access to buffile.c via those weird 'tag' interfaces.  In the
new version that isn't required, and a new struct BufFileSet is
provided by buffile.c/h.

>> +static void
>> +make_tagged_path(char *tempdirpath, char *tempfilepath,
>> +const BufFileTag *tag, int segment)
>> +{
>> +   if (tag->tablespace == DEFAULTTABLESPACE_OID ||
>> +   

Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-03-30 Thread Rafia Sabih
On Tue, Mar 28, 2017 at 11:11 AM, Rafia Sabih
 wrote:
> On Mon, Mar 27, 2017 at 12:20 PM, Thomas Munro
>  wrote:
>>
>> On Sun, Mar 26, 2017 at 3:56 PM, Thomas Munro
>>  wrote:
>> > But... what you said above must be a problem for Windows.  I believe
>> > it doesn't allow files to be unlinked if they are open, and I see that
>> > DSM segments are cleaned up in resowner's phase ==
>> > RESOURCE_RELEASE_BEFORE_LOCKS and files are closed in phase ==
>> > RESOURCE_RELEASE_AFTER_LOCKS.
>>
>> I thought this last point about Windows might be fatal to my design,
>> but it seems that Windows since at least version 2000 has support for
>> Unixoid unlinkability via the special flag FILE_SHARE_DELETE.
>
> On testing v10 of this patch over commit
> b54aad8e34bd6299093e965c50f4a23da96d7cc3 and applying the tweak
> mentioned in [1], for TPC-H queries I found the results quite
> encouraging,
>
> Experimental setup:
> TPC-H scale factor - 20
> work_mem = 1GB
> shared_buffers = 10GB
> effective_cache_size = 10GB
> random_page_cost = seq_page_cost = 0.1
> max_parallel_workers_per_gather = 4
>
> Performance numbers:
> (Time in seconds)
> Query  |  Head | Patch |
> ---
> Q3   | 73   | 37  |
> Q5   | 56   | 31  |
> Q7   | 40   | 30  |
> Q8   | 8 | 8|
> Q9   | 85   | 42  |
> Q10 | 86   | 46  |
> Q14 | 11   | 6|
> Q16 | 32   | 11  |
> Q21 | 53   | 56  |
>
> Please find the attached file for the explain analyse output of these
> queries on head as well as patch.
> Would be working on analysing the performance of this patch on 300 scale 
> factor.
>
> [1] 
> https://www.postgresql.org/message-id/flat/CAEepm%3D270ze2hVxWkJw-5eKzc3AB4C9KpH3L2kih75R5pdSogg%40mail.gmail.com
> --

Before moving to higher scale I tried playing around work_mem effects
on this patch and came across following results,
All settings are kept as before with the exception of work_mem that is
set to 64MB.

Most of the queries showed similar performance except a few, details
are as follows,
(all time are given in ms)
Query | Head| Patch
-+--+
 Q8 | 8720 | 8839
 Q18   | 370710 | 384347
 Q21   | 53270   | 65189

Clearly, regression in Q8 and Q18 is minor but that in Q21 is
significant. Just to confirm, I have applied the tweak mentioned in
[1] as before,
For the explain analyse output of Q21 on head and with patch, please
check the attached file.

 [1] 
https://www.postgresql.org/message-id/flat/CAEepm%3D270ze2hVxWkJw-5eKzc3AB4C9KpH3L2kih75R5pdSogg%40mail.gmail.com

-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/


q21_reg_ph.tar.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-03-28 Thread Andres Freund
Hi,

On 2017-03-27 22:33:03 -0700, Andres Freund wrote:
> On 2017-03-23 20:35:09 +1300, Thomas Munro wrote:
> > Here is a new patch series responding to feedback from Peter and Andres:
> 
> Here's a review of 0007 & 0010 together - they're going to have to be
> applied together anyway...
> ...
> ok, ENOTIME for today...

Continuing, where I dropped of tiredly yesterday.


-   ExecHashJoinSaveTuple(tuple,
- hashvalue,
- 
>innerBatchFile[batchno]);
+   if (HashJoinTableIsShared(hashtable))
+   sts_puttuple(hashtable->shared_inner_batches, batchno, 
,
+tuple);
+   else
+   ExecHashJoinSaveTuple(tuple,
+ hashvalue,
+ 
>innerBatchFile[batchno]);
}
 }

Why isn't this done inside of ExecHashJoinSaveTuple?




@@ -1280,6 +1785,68 @@ ExecHashTableReset(HashJoinTable hashtable)

+   /* Rewind the shared read heads for this batch, inner 
and outer. */
+   
sts_prepare_parallel_read(hashtable->shared_inner_batches,
+ 
curbatch);
+   
sts_prepare_parallel_read(hashtable->shared_outer_batches,
+ 
curbatch);

It feels somewhat wrong to do this in here, rather than on the callsites.

+   }
+
+   /*
+* Each participant needs to make sure that data it has written 
for
+* this partition is now read-only and visible to other 
participants.
+*/
+   sts_end_write(hashtable->shared_inner_batches, curbatch);
+   sts_end_write(hashtable->shared_outer_batches, curbatch);
+
+   /*
+* Wait again, so that all workers see the new hash table and 
can
+* safely read from batch files from any participant because 
they have
+* all ended writing.
+*/
+   Assert(BarrierPhase(>shared->barrier) ==
+  PHJ_PHASE_RESETTING_BATCH(curbatch));
+   BarrierWait(>shared->barrier, 
WAIT_EVENT_HASH_RESETTING);
+   Assert(BarrierPhase(>shared->barrier) ==
+  PHJ_PHASE_LOADING_BATCH(curbatch));
+   ExecHashUpdate(hashtable);
+
+   /* Forget the current chunks. */
+   hashtable->current_chunk = NULL;
+   return;
+   }
 
/*
 * Release all the hash buckets and tuples acquired in the prior pass, 
and
@@ -1289,10 +1856,10 @@ ExecHashTableReset(HashJoinTable hashtable)
oldcxt = MemoryContextSwitchTo(hashtable->batchCxt);
 
/* Reallocate and reinitialize the hash bucket headers. */
-   hashtable->buckets = (HashJoinTuple *)
-   palloc0(nbuckets * sizeof(HashJoinTuple));
+   hashtable->buckets = (HashJoinBucketHead *)
+   palloc0(nbuckets * sizeof(HashJoinBucketHead));
 
-   hashtable->spaceUsed = nbuckets * sizeof(HashJoinTuple);
+   hashtable->spaceUsed = nbuckets * sizeof(HashJoinBucketHead);
 
/* Cannot be more than our previous peak; we had this size before. */
Assert(hashtable->spaceUsed <= hashtable->spacePeak);
@@ -1301,6 +1868,22 @@ ExecHashTableReset(HashJoinTable hashtable)
 
/* Forget the chunks (the memory was freed by the context reset above). 
*/
hashtable->chunks = NULL;
+
+   /* Rewind the shared read heads for this batch, inner and outer. */
+   if (hashtable->innerBatchFile[curbatch] != NULL)
+   {
+   if (BufFileSeek(hashtable->innerBatchFile[curbatch], 0, 0L, 
SEEK_SET))
+   ereport(ERROR,
+   (errcode_for_file_access(),
+  errmsg("could not rewind hash-join temporary 
file: %m")));
+   }
+   if (hashtable->outerBatchFile[curbatch] != NULL)
+   {
+   if (BufFileSeek(hashtable->outerBatchFile[curbatch], 0, 0L, 
SEEK_SET))
+   ereport(ERROR,
+   (errcode_for_file_access(),
+  errmsg("could not rewind hash-join temporary 
file: %m")));
+   }
 }
 
 /*
@@ -1310,12 +1893,21 @@ ExecHashTableReset(HashJoinTable hashtable)
 void
 ExecHashTableResetMatchFlags(HashJoinTable hashtable)
 {
+   dsa_pointer chunk_shared = InvalidDsaPointer;
HashMemoryChunk chunk;
HashJoinTuple tuple;
int i;
 
/* Reset all flags in the main table ... */
-   chunk = hashtable->chunks;
+   if 

Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-03-28 Thread David Steele

Hi Thomas,

On 3/28/17 1:41 AM, Rafia Sabih wrote:

On Mon, Mar 27, 2017 at 12:20 PM, Thomas Munro


I thought this last point about Windows might be fatal to my design,
but it seems that Windows since at least version 2000 has support for
Unixoid unlinkability via the special flag FILE_SHARE_DELETE.


<...>



Please find the attached file for the explain analyse output of these
queries on head as well as patch.
Would be working on analysing the performance of this patch on 300 scale factor.


I have marked this submission "Waiting for Author".  A new patch is 
needed to address Andres' comments and you should have a look at Rafia's 
results.


Thanks,
--
-David
da...@pgmasters.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-03-27 Thread Andres Freund
On 2017-03-23 20:35:09 +1300, Thomas Munro wrote:
> Here is a new patch series responding to feedback from Peter and Andres:

Here's a review of 0007 & 0010 together - they're going to have to be
applied together anyway...


diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ac339fb566..775c9126c7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3814,6 +3814,21 @@ ANY num_sync ( 
+  cpu_shared_tuple_cost (floating 
point)
+  
+   cpu_shared_tuple_cost configuration 
parameter
+  
+  
+  
+   
+Sets the planner's estimate of the cost of sharing rows in
+memory during a parallel query.
+The default is 0.001.
+   
+  
+ 
+

Isn't that really low in comparison to the other costs? I think
specifying a bit more what this actually measures would be good too - is
it putting the tuple in shared memory? Is it accessing it?


+ 
+  cpu_synchronization_cost (floating 
point)
+  
+   cpu_synchronization_cost configuration 
parameter
+  
+  
+  
+   
+Sets the planner's estimate of the cost of waiting at synchronization
+points for other processes while executing parallel queries.
+The default is 1.0.
+   
+  
+ 

Isn't this also really cheap in comparison to a, probably cached, seq
page read?


+   if (HashJoinTableIsShared(hashtable))
+   {
+   /*
+* Synchronize parallel hash table builds.  At this stage we 
know that
+* the shared hash table has been created, but we don't know if 
our
+* peers are still in MultiExecHash and if so how far through.  
We use
+* the phase to synchronize with them.
+*/
+   barrier = >shared->barrier;
+
+   switch (BarrierPhase(barrier))
+   {
+   case PHJ_PHASE_BEGINNING:

Note pgindent will indent this further.  Might be worthwhile to try to
pgindent the file, revert some of the unintended damage.

/*
 * set expression context
 */

I'd still like this to be moved to the start.


@@ -126,17 +202,79 @@ MultiExecHash(HashState *node)
/* Not subject to skew optimization, so insert 
normally */
ExecHashTableInsert(hashtable, slot, hashvalue);
}
-   hashtable->totalTuples += 1;
+   hashtable->partialTuples += 1;
+   if (!HashJoinTableIsShared(hashtable))
+   hashtable->totalTuples += 1;
}
}

FWIW, I'd put HashJoinTableIsShared() into a local var - the compiler
won't be able to do that on its own because external function calls
could invalidate the result.

That brings me to a related topic: Have you measured whether your
changes cause performance differences?


+   finish_loading(hashtable);

I find the sudden switch to a different naming scheme in the same file a
bit jarring.


+   if (HashJoinTableIsShared(hashtable))
+   BarrierDetach(>shared->shrink_barrier);
+
+   if (HashJoinTableIsShared(hashtable))
+   {

Consecutive if blocks with the same condition...


+   bool elected_to_resize;
+
+   /*
+* Wait for all backends to finish building.  If only one 
worker is
+* running the building phase because of a non-partial inner 
plan, the
+* other workers will pile up here waiting.  If multiple worker 
are
+* building, they should finish close to each other in time.
+*/

That comment is outdated, isn't it?

 
/* resize the hash table if needed (NTUP_PER_BUCKET exceeded) */
-   if (hashtable->nbuckets != hashtable->nbuckets_optimal)
-   ExecHashIncreaseNumBuckets(hashtable);
+   ExecHashUpdate(hashtable);
+   ExecHashIncreaseNumBuckets(hashtable);

So this now doesn't actually increase the number of buckets anymore.

+ reinsert:
+   /* If the table was resized, insert tuples into the new buckets. */
+   ExecHashUpdate(hashtable);
+   ExecHashReinsertAll(hashtable);

ReinsertAll just happens to do nothing if we didn't have to
resize... Not entirely obvious, sure reads as if it were unconditional.
Also, it's not actually "All" when batching is in use, no?


+ post_resize:
+   if (HashJoinTableIsShared(hashtable))
+   {
+   Assert(BarrierPhase(barrier) == PHJ_PHASE_RESIZING);
+   BarrierWait(barrier, WAIT_EVENT_HASH_RESIZING);
+   Assert(BarrierPhase(barrier) == PHJ_PHASE_REINSERTING);
+   }
+
+ reinsert:
+   /* If the table was resized, insert tuples into the new buckets. */
+   ExecHashUpdate(hashtable);
+   ExecHashReinsertAll(hashtable);

Hm.  So even non-resizing backends reach this - but they happen to not
do anything 

Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-03-27 Thread Thomas Munro
On Sun, Mar 26, 2017 at 3:56 PM, Thomas Munro
 wrote:
> But... what you said above must be a problem for Windows.  I believe
> it doesn't allow files to be unlinked if they are open, and I see that
> DSM segments are cleaned up in resowner's phase ==
> RESOURCE_RELEASE_BEFORE_LOCKS and files are closed in phase ==
> RESOURCE_RELEASE_AFTER_LOCKS.

I thought this last point about Windows might be fatal to my design,
but it seems that Windows since at least version 2000 has support for
Unixoid unlinkability via the special flag FILE_SHARE_DELETE.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-03-26 Thread Peter Geoghegan
On Sun, Mar 26, 2017 at 6:50 PM, Thomas Munro
 wrote:
> Like you, I also tend to suspect that people would be more likely to
> use RAID type technologies to stripe things like this for both
> bandwidth and space reasons these days.  Tablespaces seem to make more
> sense as a way of separating different classes of storage
> (fast/expensive, slow/cheap etc), not as an IO or space striping
> technique.

I agree.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-03-26 Thread Thomas Munro
On Mon, Mar 27, 2017 at 12:12 PM, Peter Geoghegan  wrote:
> On Sun, Mar 26, 2017 at 3:41 PM, Thomas Munro
>  wrote:
>>> 1.  Segments are what buffile.c already calls the individual
>>> capped-at-1GB files that it manages.  They are an implementation
>>> detail that is not part of buffile.c's user interface.  There seems to
>>> be no reason to change that.
>>
>> After reading your next email I realised this is not quite true:
>> BufFileTell and BufFileSeek expose the existence of segments.
>
> Yeah, that's something that tuplestore.c itself relies on.
>
> I always thought that the main reason practical why we have BufFile
> multiplex 1GB segments concerns use of temp_tablespaces, rather than
> considerations that matter only when using obsolete file systems:
>
> /*
>  * We break BufFiles into gigabyte-sized segments, regardless of RELSEG_SIZE.
>  * The reason is that we'd like large temporary BufFiles to be spread across
>  * multiple tablespaces when available.
>  */
>
> Now, I tend to think that most installations that care about
> performance would be better off using RAID to stripe their one temp
> tablespace file system. But, I suppose this still makes sense when you
> have a number of file systems that happen to be available, and disk
> capacity is the main concern. PHJ uses one temp tablespace per worker,
> which I further suppose might not be as effective in balancing disk
> space usage.

I was thinking about IO bandwidth balance rather than size.  If you
rotate through tablespaces segment-by-segment, won't you be exposed to
phasing effects that could leave disk arrays idle for periods of time?
 Whereas if you assign them to participants, you can only get idle
arrays if you have fewer participants than tablespaces.

This seems like a fairly complex subtopic and I don't have a strong
view on it.  Clearly you could rotate through tablespaces on the basis
of participant, partition, segment, some combination, or something
else.  Doing it by participant seemed to me to be the least prone to
IO imbalance cause by phasing effects (= segment based) or data
distribution (= partition based), of the options I considered when I
wrote it that way.

Like you, I also tend to suspect that people would be more likely to
use RAID type technologies to stripe things like this for both
bandwidth and space reasons these days.  Tablespaces seem to make more
sense as a way of separating different classes of storage
(fast/expensive, slow/cheap etc), not as an IO or space striping
technique.  I may be way off base there though...

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-03-26 Thread Peter Geoghegan
On Sun, Mar 26, 2017 at 3:41 PM, Thomas Munro
 wrote:
>> 1.  Segments are what buffile.c already calls the individual
>> capped-at-1GB files that it manages.  They are an implementation
>> detail that is not part of buffile.c's user interface.  There seems to
>> be no reason to change that.
>
> After reading your next email I realised this is not quite true:
> BufFileTell and BufFileSeek expose the existence of segments.

Yeah, that's something that tuplestore.c itself relies on.

I always thought that the main reason practical why we have BufFile
multiplex 1GB segments concerns use of temp_tablespaces, rather than
considerations that matter only when using obsolete file systems:

/*
 * We break BufFiles into gigabyte-sized segments, regardless of RELSEG_SIZE.
 * The reason is that we'd like large temporary BufFiles to be spread across
 * multiple tablespaces when available.
 */

Now, I tend to think that most installations that care about
performance would be better off using RAID to stripe their one temp
tablespace file system. But, I suppose this still makes sense when you
have a number of file systems that happen to be available, and disk
capacity is the main concern. PHJ uses one temp tablespace per worker,
which I further suppose might not be as effective in balancing disk
space usage.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-03-26 Thread Thomas Munro
On Mon, Mar 27, 2017 at 11:03 AM, Thomas Munro
 >> Is there a risk that this ends up
running afoul of filename length
>> limits on some platforms?
>
> Hmm.  I didn't think so.  Do we have a project guideline on maximum
> path lengths based on some kind of survey?  There are various limits
> involved (filesystem and OS per-path-component limits, total limits,
> and the confusing PATH_MAX, MAX_PATH etc macros), but I was under the
> impression that these numbers were always at least 255.  This scheme
> seems capable of producing ~50 bytes in the final component
> (admittedly more if int is 64 bits), and then nowhere near enough to
> reach a limit of that order in the earlier components.

Err, plus prefix.  Still seems unlikely to be too long.

>> I'm a bit unhappy with the partition terminology around this. It's
>> getting a bit confusing. We have partitions, participants and
>> segements. Most of them could be understood for something entirely
>> different than the meaning you have here...
>
> Ok.  Let me try to explain and defend them and see if we can come up
> with something better.
>
> 1.  Segments are what buffile.c already calls the individual
> capped-at-1GB files that it manages.  They are an implementation
> detail that is not part of buffile.c's user interface.  There seems to
> be no reason to change that.

After reading your next email I realised this is not quite true:
BufFileTell and BufFileSeek expose the existence of segments.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-03-26 Thread Thomas Munro
On Mon, Mar 27, 2017 at 9:41 AM, Andres Freund  wrote:
> Hi,
>
>
> SharedBufFile allows temporary files to be created by one backend and
> then exported for read-only access by other backends, with clean-up
> managed by reference counting associated with a DSM segment.  This includes
> changes to fd.c and buffile.c to support new kinds of temporary file.
>
>
> diff --git a/src/backend/storage/file/buffile.c 
> b/src/backend/storage/file/buffile.c
> index 4ca0ea4..a509c05 100644
> --- a/src/backend/storage/file/buffile.c
> +++ b/src/backend/storage/file/buffile.c
>
> I think the new facilities should be explained in the file's header.

Will do.

> @@ -68,9 +71,10 @@ struct BufFile
>  * avoid making redundant FileSeek calls.
>  */
>
> -   boolisTemp; /* can only add files if this 
> is TRUE */
> +   boolisSegmented;/* can only add files if this is TRUE 
> */
>
> That's a bit of a weird and uncommented upon change.

I was trying to cut down on the number of places we use the word
'temporary' to activate various different behaviours.  In this case,
the only thing it controls is whether the BufFile is backed by one
single fd.c File or many segments, so I figured it should be renamed.

As Peter and you have pointed out, there may be a case for removing it
altogether.

> @@ -79,6 +83,8 @@ struct BufFile
>  */
> ResourceOwner resowner;
>
> +   BufFileTag  tag;/* for discoverability 
> between backends */
>
> Not perfectly happy with the name tag here, the name is a bit too
> similar to BufferTag - something quite different.

Yeah, will rename.

> +static void
> +make_tagged_path(char *tempdirpath, char *tempfilepath,
> +const BufFileTag *tag, int segment)
> +{
> +   if (tag->tablespace == DEFAULTTABLESPACE_OID ||
> +   tag->tablespace == GLOBALTABLESPACE_OID)
> +   snprintf(tempdirpath, MAXPGPATH, "base/%s", 
> PG_TEMP_FILES_DIR);
> +   else
> +   {
> +   snprintf(tempdirpath, MAXPGPATH, "pg_tblspc/%u/%s/%s",
> +tag->tablespace, 
> TABLESPACE_VERSION_DIRECTORY,
> +PG_TEMP_FILES_DIR);
> +   }
> +
> +   snprintf(tempfilepath, MAXPGPATH, "%s/%s%d.%d.%d.%d.%d", tempdirpath,
> +PG_TEMP_FILE_PREFIX,
> +tag->creator_pid, tag->set, tag->partition, 
> tag->participant,
> +segment);
>
> Is there a risk that this ends up running afoul of filename length
> limits on some platforms?

Hmm.  I didn't think so.  Do we have a project guideline on maximum
path lengths based on some kind of survey?  There are various limits
involved (filesystem and OS per-path-component limits, total limits,
and the confusing PATH_MAX, MAX_PATH etc macros), but I was under the
impression that these numbers were always at least 255.  This scheme
seems capable of producing ~50 bytes in the final component
(admittedly more if int is 64 bits), and then nowhere near enough to
reach a limit of that order in the earlier components.

> +}
> +
> +static File
> +make_tagged_segment(const BufFileTag *tag, int segment)
> +{
> +   Filefile;
> +   chartempdirpath[MAXPGPATH];
> +   chartempfilepath[MAXPGPATH];
> +
> +   /*
> +* There is a remote chance that disk files with this (pid, set) pair
> +* already exists after a crash-restart.  Since the presence of
> +* consecutively numbered segment files is used by BufFileOpenShared 
> to
> +* determine the total size of a shared BufFile, we'll defend against
> +* confusion by unlinking segment 1 (if it exists) before creating 
> segment
> +* 0.
> +*/
>
> Gah.  Why on earth aren't we removing temp files when restarting, not
> just on the initial start?  That seems completely wrong?

See the comment above RemovePgTempFiles in fd.c.  From comments on
this list I understand that this is a subject that Robert and Tom
don't agree on.  I don't mind either way, but as long as
RemovePgTempFiles works that way and my patch uses the existence of
files to know how many files there are, I have to defend against that
danger by making sure that I don't accidentally identify files from
before a crash/restart as active.

> If we do decide not to change this: Why is that sufficient? Doesn't the
> same problem exist for segments later than the first?

It does exist and it is handled.  The comment really should say
"unlinking segment N + 1 (if it exists) before creating segment N".
Will update.

> +/*
> + * Open a file that was previously created in another backend with
> + * BufFileCreateShared.
> + */
> +BufFile *
> +BufFileOpenTagged(const BufFileTag *tag)
> +{
> +   BufFile*file = (BufFile *) palloc(sizeof(BufFile));
> +   chartempdirpath[MAXPGPATH];
> + 

Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-03-26 Thread Andres Freund
On 2017-03-23 20:35:09 +1300, Thomas Munro wrote:
> Here is a new patch series responding to feedback from Peter and Andres:

+
+/* Per-participant shared state. */
+typedef struct SharedTuplestoreParticipant
+{
+   LWLock lock;

Hm. No padding (ala LWLockMinimallyPadded / LWLockPadded) - but that's
probably ok, for now.




+   bool error; /* Error occurred flag. 
*/
+   bool eof;   /* End of file reached. 
*/
+   int read_fileno;/* BufFile segment file number. 
*/
+   off_t read_offset;  /* Offset within segment file. 
*/

Hm. I wonder if it'd not be better to work with 64bit offsets, and just
separate that out upon segment access.

+/* The main data structure in shared memory. */

"main data structure" isn't particularly meaningful.

+struct SharedTuplestore
+{
+   int reading_partition;
+   int nparticipants;
+   int flags;

Maybe add a comment saying /* flag bits from SHARED_TUPLESTORE_* */?

+   Size meta_data_size;

What's this?

+   SharedTuplestoreParticipant participants[FLEXIBLE_ARRAY_MEMBER];

I'd add a comment here, that there's further data after participants.

+};

+
+/* Per-participant backend-private state. */
+struct SharedTuplestoreAccessor
+{

Hm. The name and it being backend-local are a bit conflicting.

+   int participant;/* My partitipant number. */
+   SharedTuplestore *sts;  /* The shared state. */
+   int nfiles; /* Size of local files 
array. */
+   BufFile **files;/* Files we have open locally 
for writing. */

Shouldn't this mention that it's indexed by partition?

+   BufFile *read_file; /* The current file to read 
from. */
+   int read_partition; /* The current partition to 
read from. */
+   int read_participant;   /* The current participant to read 
from. */
+   int read_fileno;/* BufFile segment file number. 
*/
+   off_t read_offset;  /* Offset within segment file. 
*/
+};


+/*
+ * Initialize a SharedTuplestore in existing shared memory.  There must be
+ * space for sts_size(participants) bytes.  If flags is set to the value
+ * SHARED_TUPLESTORE_SINGLE_PASS then each partition may only be read once,
+ * because underlying files will be deleted.

Any reason not to use flags that are compatible with tuplestore.c?

+ * Tuples that are stored may optionally carry a piece of fixed sized
+ * meta-data which will be retrieved along with the tuple.  This is useful for
+ * the hash codes used for multi-batch hash joins, but could have other
+ * applications.
+ */
+SharedTuplestoreAccessor *
+sts_initialize(SharedTuplestore *sts, int participants,
+  int my_participant_number,
+  Size meta_data_size,
+  int flags,
+  dsm_segment *segment)
+{

Not sure I like that the naming here has little in common with
tuplestore.h's api.


+
+MinimalTuple
+sts_gettuple(SharedTuplestoreAccessor *accessor, void *meta_data)
+{

This needs docs.

+   SharedBufFileSet *fileset = GetSharedBufFileSet(accessor->sts);
+   MinimalTuple tuple = NULL;
+
+   for (;;)
+   {

...
+   /* Check if this participant's file has already been entirely 
read. */
+   if (participant->eof)
+   {
+   BufFileClose(accessor->read_file);
+   accessor->read_file = NULL;
+   LWLockRelease(>lock);
+   continue;

Why are we closing the file while holding the lock?

+
+   /* Read the optional meta-data. */
+   eof = false;
+   if (accessor->sts->meta_data_size > 0)
+   {
+   nread = BufFileRead(accessor->read_file, meta_data,
+   
accessor->sts->meta_data_size);
+   if (nread == 0)
+   eof = true;
+   else if (nread != accessor->sts->meta_data_size)
+   ereport(ERROR,
+   (errcode_for_file_access(),
+errmsg("could not read from 
temporary file: %m")));
+   }
+
+   /* Read the size. */
+   if (!eof)
+   {
+   nread = BufFileRead(accessor->read_file, _size, 
sizeof(tuple_size));
+   if (nread == 0)
+   eof = true;

Why is it legal to have EOF here, if metadata previously didn't have an
EOF? Perhaps add an error if accessor->sts->meta_data_size != 0?


+   if (eof)
+   {
+   

Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-03-26 Thread Andres Freund
Hi,


SharedBufFile allows temporary files to be created by one backend and
then exported for read-only access by other backends, with clean-up
managed by reference counting associated with a DSM segment.  This includes
changes to fd.c and buffile.c to support new kinds of temporary file.


diff --git a/src/backend/storage/file/buffile.c 
b/src/backend/storage/file/buffile.c
index 4ca0ea4..a509c05 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c

I think the new facilities should be explained in the file's header.


@@ -68,9 +71,10 @@ struct BufFile
 * avoid making redundant FileSeek calls.
 */
 
-   boolisTemp; /* can only add files if this 
is TRUE */
+   boolisSegmented;/* can only add files if this is TRUE */

That's a bit of a weird and uncommented upon change.


@@ -79,6 +83,8 @@ struct BufFile
 */
ResourceOwner resowner;
 
+   BufFileTag  tag;/* for discoverability between 
backends */

Not perfectly happy with the name tag here, the name is a bit too
similar to BufferTag - something quite different.



+static void
+make_tagged_path(char *tempdirpath, char *tempfilepath,
+const BufFileTag *tag, int segment)
+{
+   if (tag->tablespace == DEFAULTTABLESPACE_OID ||
+   tag->tablespace == GLOBALTABLESPACE_OID)
+   snprintf(tempdirpath, MAXPGPATH, "base/%s", PG_TEMP_FILES_DIR);
+   else
+   {
+   snprintf(tempdirpath, MAXPGPATH, "pg_tblspc/%u/%s/%s",
+tag->tablespace, TABLESPACE_VERSION_DIRECTORY,
+PG_TEMP_FILES_DIR);
+   }
+
+   snprintf(tempfilepath, MAXPGPATH, "%s/%s%d.%d.%d.%d.%d", tempdirpath,
+PG_TEMP_FILE_PREFIX,
+tag->creator_pid, tag->set, tag->partition, 
tag->participant,
+segment);

Is there a risk that this ends up running afoul of filename length
limits on some platforms?

+}
+
+static File
+make_tagged_segment(const BufFileTag *tag, int segment)
+{
+   Filefile;
+   chartempdirpath[MAXPGPATH];
+   chartempfilepath[MAXPGPATH];
+
+   /*
+* There is a remote chance that disk files with this (pid, set) pair
+* already exists after a crash-restart.  Since the presence of
+* consecutively numbered segment files is used by BufFileOpenShared to
+* determine the total size of a shared BufFile, we'll defend against
+* confusion by unlinking segment 1 (if it exists) before creating 
segment
+* 0.
+*/

Gah.  Why on earth aren't we removing temp files when restarting, not
just on the initial start?  That seems completely wrong?


If we do decide not to change this: Why is that sufficient? Doesn't the
same problem exist for segments later than the first?

+/*
+ * Open a file that was previously created in another backend with
+ * BufFileCreateShared.
+ */
+BufFile *
+BufFileOpenTagged(const BufFileTag *tag)
+{
+   BufFile*file = (BufFile *) palloc(sizeof(BufFile));
+   chartempdirpath[MAXPGPATH];
+   chartempfilepath[MAXPGPATH];
+   Sizecapacity = 1024;
+   File   *files = palloc(sizeof(File) * capacity);
+   int nfiles = 0;
+
+   /*
+* We don't know how many segments there are, so we'll probe the
+* filesystem to find out.
+*/
+   for (;;)
+   {
+   /* See if we need to expand our file space. */
+   if (nfiles + 1 > capacity)
+   {
+   capacity *= 2;
+   files = repalloc(files, sizeof(File) * capacity);
+   }
+   /* Try to load a segment. */
+   make_tagged_path(tempdirpath, tempfilepath, tag, nfiles);
+   files[nfiles] = PathNameOpenTemporaryFile(tempfilepath);
+   if (files[nfiles] <= 0)
+   break;

Isn't 0 a theoretically valid return value from
PathNameOpenTemporaryFile?

+/*
+ * Delete a BufFile that was created by BufFileCreateTagged.  Return true if
+ * at least one segment was deleted; false indicates that no segment was
+ * found, or an error occurred while trying to delete.  Errors are logged but
+ * the function returns normally because this is assumed to run in a clean-up
+ * path that might already involve an error.
+ */
+bool
+BufFileDeleteTagged(const BufFileTag *tag)
+{
+   chartempdirpath[MAXPGPATH];
+   chartempfilepath[MAXPGPATH];
+   int segment = 0;
+   boolfound = false;
+
+   /*
+* We don't know if the BufFile really exists, because SharedBufFile
+* tracks only the range of file numbers.  If it does exists, we don't
+* know many 

Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-03-25 Thread Peter Geoghegan
On Sat, Mar 25, 2017 at 7:56 PM, Thomas Munro
 wrote:
> On Sun, Mar 26, 2017 at 1:53 PM, Peter Geoghegan  wrote:
>> ISTM that your patch now shares a quality with parallel tuplesort: You
>> may now hold files open after an unlink() of the original link/path
>> that they were opened using. As Robert pointed out when discussing
>> parallel tuplesort earlier in the week, that comes with the risk,
>> however small, that the vFD cache will close() the file out from under
>> us during LRU maintenance, resulting in a subsequent open() (at the
>> tail-end of the vFD's lifetime) that fails unexpectedly. It's probably
>> fine to assume that we can sanely close() the file ourselves in fd.c
>> error paths despite a concurrent unlink(), since we never operate on
>> the link itself, and there probably isn't much pressure on each
>> backend's vFD cache. But, is that good enough? I can't say, though I
>> suspect that this particular risk is one that's best avoided.
>>
>> I haven't tested out how much of a problem this might be for your
>> patch, but I do know that resowner.c will call your shared mem segment
>> callback before closing any backend local vFDs, so I can't imagine how
>> it could be that this risk doesn't exist.
>
> I wouldn't have expected anything like that to be a problem, because
> FileClose() doesn't call FileAccess().  So IIUC it wouldn't ever try
> to reopen a kernel fd just to close it.

The concern is that something somewhere does. For example, mdread()
calls FileRead(), which calls FileAccess(), ultimately because of some
obscure catalog access. It's very hard to reason about things like
that.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-03-25 Thread Thomas Munro
On Sun, Mar 26, 2017 at 1:53 PM, Peter Geoghegan  wrote:
> ISTM that your patch now shares a quality with parallel tuplesort: You
> may now hold files open after an unlink() of the original link/path
> that they were opened using. As Robert pointed out when discussing
> parallel tuplesort earlier in the week, that comes with the risk,
> however small, that the vFD cache will close() the file out from under
> us during LRU maintenance, resulting in a subsequent open() (at the
> tail-end of the vFD's lifetime) that fails unexpectedly. It's probably
> fine to assume that we can sanely close() the file ourselves in fd.c
> error paths despite a concurrent unlink(), since we never operate on
> the link itself, and there probably isn't much pressure on each
> backend's vFD cache. But, is that good enough? I can't say, though I
> suspect that this particular risk is one that's best avoided.
>
> I haven't tested out how much of a problem this might be for your
> patch, but I do know that resowner.c will call your shared mem segment
> callback before closing any backend local vFDs, so I can't imagine how
> it could be that this risk doesn't exist.

I wouldn't have expected anything like that to be a problem, because
FileClose() doesn't call FileAccess().  So IIUC it wouldn't ever try
to reopen a kernel fd just to close it.

But... what you said above must be a problem for Windows.  I believe
it doesn't allow files to be unlinked if they are open, and I see that
DSM segments are cleaned up in resowner's phase ==
RESOURCE_RELEASE_BEFORE_LOCKS and files are closed in phase ==
RESOURCE_RELEASE_AFTER_LOCKS.

Hmm.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-03-25 Thread Peter Geoghegan
On Thu, Mar 23, 2017 at 12:35 AM, Thomas Munro
 wrote:
> Thanks.  I really appreciate your patience with the resource
> management stuff I had failed to think through.

It's a surprisingly difficult problem, that almost requires
prototyping just to explain. No need to apologize. This is the process
by which many hard problems end up being solved.

>> However, I notice that the place that this happens instead,
>> PathNameDelete(), does not repeat the fd.c step of doing a final
>> stat(), and using the stats for a pgstat_report_tempfile(). So, a new
>> pgstat_report_tempfile() call is simply missing. However, the more
>> fundamental issue in my mind is: How can you fix that? Where would it
>> go if you had it?
>
> You're right.  I may be missing something here (again), but it does
> seem straightforward to implement because we always delete each file
> that really exists exactly once (and sometimes we also try to delete
> files that don't exist due to imprecise meta-data, but that isn't
> harmful and we know when that turns out to be the case).

ISTM that your patch now shares a quality with parallel tuplesort: You
may now hold files open after an unlink() of the original link/path
that they were opened using. As Robert pointed out when discussing
parallel tuplesort earlier in the week, that comes with the risk,
however small, that the vFD cache will close() the file out from under
us during LRU maintenance, resulting in a subsequent open() (at the
tail-end of the vFD's lifetime) that fails unexpectedly. It's probably
fine to assume that we can sanely close() the file ourselves in fd.c
error paths despite a concurrent unlink(), since we never operate on
the link itself, and there probably isn't much pressure on each
backend's vFD cache. But, is that good enough? I can't say, though I
suspect that this particular risk is one that's best avoided.

I haven't tested out how much of a problem this might be for your
patch, but I do know that resowner.c will call your shared mem segment
callback before closing any backend local vFDs, so I can't imagine how
it could be that this risk doesn't exist.

FWIW, I briefly entertained the idea that we could pin a vFD for just
a moment, ensuring that the real FD could not be close()'d out by
vfdcache LRU maintenance, which would fix this problem for parallel
tuplesort, I suppose. That may not be workable for PHJ, because PHJ
would probably need to hold on to such a "pin" for much longer, owing
to the lack of any explicit "handover" phase.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-03-23 Thread Thomas Munro
Hi,

Here is a new patch series responding to feedback from Peter and Andres:

1.  Support pgstat_report_tempfile and log_temp_files, which I had
overlooked as Peter pointed out.

2.  Use a patch format that is acceptable to git am, per complaint
off-list from Andres.  (Not actually made with git format-patch; I
need to learn some more git-fu, but they now apply cleanly with git
am).

On Thu, Mar 23, 2017 at 12:55 PM, Peter Geoghegan  wrote:
> I took a quick look at your V9 today. This is by no means a
> comprehensive review of 0008-hj-shared-buf-file-v9.patch, but it's
> what I can manage right now.

Thanks.  I really appreciate your patience with the resource
management stuff I had failed to think through.

> ...
>
> However, I notice that the place that this happens instead,
> PathNameDelete(), does not repeat the fd.c step of doing a final
> stat(), and using the stats for a pgstat_report_tempfile(). So, a new
> pgstat_report_tempfile() call is simply missing. However, the more
> fundamental issue in my mind is: How can you fix that? Where would it
> go if you had it?

You're right.  I may be missing something here (again), but it does
seem straightforward to implement because we always delete each file
that really exists exactly once (and sometimes we also try to delete
files that don't exist due to imprecise meta-data, but that isn't
harmful and we know when that turns out to be the case).

> If you do the obvious thing of just placing that before the new
> unlink() within PathNameDelete(), on the theory that that needs parity
> with the fd.c stuff, that has non-obvious implications. Does the
> pgstat_report_tempfile() call need to happen when going through this
> path, for example?:
>
>> +/*
>> + * Destroy a shared BufFile early.  Files are normally cleaned up
>> + * automatically when all participants detach, but it might be useful to
>> + * reclaim disk space sooner than that.  The caller asserts that no backends
>> + * will attempt to read from this file again and that only one backend will
>> + * destroy it.
>> + */
>> +void
>> +SharedBufFileDestroy(SharedBufFileSet *set, int partition, int participant)
>> +{

Yes, I think it should definitely go into
PathNameDeleteTemporaryFile() (formerly PathNameDelete()).

> The theory with the unlink()'ing() function PathNameDelete(), I
> gather, is that it doesn't matter if it happens to be called more than
> once, say from a worker and then in an error handling path in the
> leader or whatever. Do I have that right?

Yes, it may be called for a file that doesn't exist either because it
never existed, or because it has already been deleted.  To recap,
there are two reasons it needs to tolerate attempts to delete files
that aren't there:

1.  To be able to delete the fd.c files backing a BufFile given only a
BufFileTag.  We don't know how many segment files there are, but we
know how to build the prefix of the filename so we try to delete
[prefix].0, [prefix].1, [prefix].2 ... until we get ENOENT and
terminate.  I think this sort of thing would be more questionable for
durable storage backing a database object, but for temporary files I
can't think of a problem with it.

2.  SharedBufFileSet doesn't actually know how many partitions exist,
it just knows the *range* of partition numbers (because of its
conflicting fixed space and increasable partitions requirements).
>From that information it can loop building BufFileTags for all backing
files that *might* exist, and in practice they usually do because we
don't tend to have a 'sparse' range of partitions.

The error handling path isn't a special case: whoever is the last to
detach from the DSM segment will delete all the files, whether that
results from an error or not.  Now someone might call
SharedBufFileDestroy() to delete files sooner, but that can't happen
at the same time as a detach cleanup (the caller is still attached).

As a small optimisation avoiding a bunch of pointless unlink syscalls,
I shrink the SharedBufFileSet range if you happen to delete explicitly
with a partition number at the extremities of the range, and it so
happens that Parallel Hash Join explicitly deletes them in partition
order as the join runs, so in practice the range is empty by the time
SharedBufFileSet's cleanup runs and there is nothing to do, unless an
error occurs.

> Obviously the concern I have about that is that any stat() call you
> might add for the benefit of a new pgstat_report_tempfile() call,
> needed to keep parity with fd.c, now has a risk of double counting in
> error paths, if I'm not mistaken. We do need to do that accounting in
> the event of error, just as we do when there is no error, at least if
> current stats collector behavior is to be preserved. How can you
> determine which duplicate call here is the duplicate? In other words,
> how can you figure out which one is not supposed to
> pgstat_report_tempfile()? If the size of temp files in each worker is
> unknowable to the 

Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-03-22 Thread Peter Geoghegan
On Wed, Mar 22, 2017 at 3:17 AM, Thomas Munro
 wrote:
>> If I follow the new code correctly, then it doesn't matter that you've
>> unlink()'d to take care of the more obvious resource management chore.
>> You can still have a reference leak like this, if I'm not mistaken,
>> because you still have backend local state (local VfdCache) that is
>> left totally decoupled with the new "shadow resource manager" for
>> shared BufFiles.
>
> You're right.  The attached version fixes these problems.  The
> BufFiles created or opened in this new way now participate in both of
> our leak-detection and clean-up schemes:  the one in resowner.c
> (because I'm now explicitly registering with it as I had failed to do
> before) and the one in CleanupTempFiles (because FD_CLOSE_AT_EOXACT is
> set, which I already had in the previous version for the creator, but
> not the opener of such a file).  I tested by commenting out my
> explicit BufFileClose calls to check that resowner.c starts
> complaining, and then by commenting out the resowner registration too
> to check that CleanupTempFiles starts complaining.

I took a quick look at your V9 today. This is by no means a
comprehensive review of 0008-hj-shared-buf-file-v9.patch, but it's
what I can manage right now.

The main change you made is well represented by the following part of
the patch, where you decouple close at eoXact with delete at eoXact,
with the intention of doing one but not the other for BufFiles that
are shared:

>  /* these are the assigned bits in fdstate below: */
> -#define FD_TEMPORARY   (1 << 0)/* T = delete when closed */
> -#define FD_XACT_TEMPORARY  (1 << 1)/* T = delete at eoXact */
> +#define FD_DELETE_AT_CLOSE (1 << 0)/* T = delete when closed */
> +#define FD_CLOSE_AT_EOXACT (1 << 1)/* T = close at eoXact */
> +#define FD_TEMP_FILE_LIMIT (1 << 2)/* T = respect temp_file_limit */

So, shared BufFile fd.c segments within backend local resource manager
do not have FD_DELETE_AT_CLOSE set, because you mean to do that part
yourself by means of generic shared cleanup through dynamic shared
memory segment callback. So far so good.

However, I notice that the place that this happens instead,
PathNameDelete(), does not repeat the fd.c step of doing a final
stat(), and using the stats for a pgstat_report_tempfile(). So, a new
pgstat_report_tempfile() call is simply missing. However, the more
fundamental issue in my mind is: How can you fix that? Where would it
go if you had it?

If you do the obvious thing of just placing that before the new
unlink() within PathNameDelete(), on the theory that that needs parity
with the fd.c stuff, that has non-obvious implications. Does the
pgstat_report_tempfile() call need to happen when going through this
path, for example?:

> +/*
> + * Destroy a shared BufFile early.  Files are normally cleaned up
> + * automatically when all participants detach, but it might be useful to
> + * reclaim disk space sooner than that.  The caller asserts that no backends
> + * will attempt to read from this file again and that only one backend will
> + * destroy it.
> + */
> +void
> +SharedBufFileDestroy(SharedBufFileSet *set, int partition, int participant)
> +{

The theory with the unlink()'ing() function PathNameDelete(), I
gather, is that it doesn't matter if it happens to be called more than
once, say from a worker and then in an error handling path in the
leader or whatever. Do I have that right?

Obviously the concern I have about that is that any stat() call you
might add for the benefit of a new pgstat_report_tempfile() call,
needed to keep parity with fd.c, now has a risk of double counting in
error paths, if I'm not mistaken. We do need to do that accounting in
the event of error, just as we do when there is no error, at least if
current stats collector behavior is to be preserved. How can you
determine which duplicate call here is the duplicate? In other words,
how can you figure out which one is not supposed to
pgstat_report_tempfile()? If the size of temp files in each worker is
unknowable to the implementation in error paths, does it not follow
that it's unknowable to the user that queries pg_stat_database?

Now, I don't imagine that this should stump you. Maybe I'm wrong about
that possibility (that you cannot have exactly once
unlink()/stat()/whatever), or maybe I'm right and you can fix it while
preserving existing behavior, for example by relying on unlink()
reliably failing when called a second time, no matter how tight any
race was. What exact semantics does unlink() have with concurrency, as
far as the link itself goes?

If I'm not wrong about the general possibility, then maybe the
existing behavior doesn't need to be preserved in error paths, which
are after all exceptional -- it's not as if the statistics collector
is currently highly reliable. It's not obvious that you are
deliberately accepting of any of these risks or costs, though, which I
think needs to 

Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-03-22 Thread Thomas Munro
Hi,

Here is a new version addressing feedback from Peter and Andres.
Please see below.

On Wed, Mar 22, 2017 at 3:18 PM, Peter Geoghegan  wrote:
> On Tue, Mar 21, 2017 at 5:07 AM, Thomas Munro
>  wrote:
>>> buffile.c should stop pretending to care about anything other than
>>> temp files, IMV. 100% of all clients that want temporary files go
>>> through buffile.c. 100% of all clients that want non-temp files (files
>>> which are not marked FD_TEMPORARY) access fd.c directly, rather than
>>> going through buffile.c.
>>
>> I still need BufFile because I want buffering.
>>
>> There are 3 separate characteristics enabled by flags with 'temporary'
>> in their name.  I think we should consider separating the concerns by
>> splitting and renaming them:
>>
>> 1.  Segmented BufFile behaviour.  I propose renaming BufFile's isTemp
>> member to isSegmented, because that is what it really does.   I want
>> that feature independently without getting confused about lifetimes.
>> Tested with small MAX_PHYSICAL_FILESIZE as you suggested.
>
> I would have proposed to get rid of the isTemp field entirely. It is
> always true with current usage, any only #ifdef NOT_USED code presumes
> that it could be any other way. BufFile is all about temp files, which
> ISTM should be formalized. The whole point of BufFile is to segment
> fd.c temp file segments. Who would ever want to use BufFile without
> that capability anyway?

Yeah, it looks like you're probably right, but I guess others could
have uses for BufFile that we don't know about.  It doesn't seem like
it hurts to leave the variable in existence.

>> 2.  The temp_file_limit system.  Currently this applies to fd.c files
>> opened with FD_TEMPORARY.  You're right that we shouldn't be able to
>> escape that sanity check on disk space just because we want to manage
>> disk file ownership differently.  I propose that we create a new flag
>> FD_TEMP_FILE_LIMIT that can be set independentlyisTemp of the flags
>> controlling disk file lifetime.  When working with SharedBufFileSet,
>> the limit applies to each backend in respect of files it created,
>> while it has them open.  This seems a lot simpler than any
>> shared-temp-file-limit type scheme and is vaguely similar to the way
>> work_mem applies in each backend for parallel query.
>
> I agree that that makes sense as a user-visible behavior of
> temp_file_limit. This user-visible behavior is what I actually
> implemented for parallel CREATE INDEX.

Ok, good.

>> 3.  Delete-on-close/delete-at-end-of-xact.  I don't want to use that
>> facility so I propose disconnecting it from the above.  We c{ould
>> rename those fd.c-internal flags FD_TEMPORARY and FD_XACT_TEMPORARY to
>> FD_DELETE_AT_CLOSE and FD_DELETE_AT_EOXACT.
>
> This reliably unlink()s all files, albeit while relying on unlink()
> ENOENT as a condition that terminates deletion of one particular
> worker's BufFile's segments. However, because you effectively no
> longer use resowner.c, ISTM that there is still a resource leak in
> error paths. ResourceOwnerReleaseInternal() won't call FileClose() for
> temp-ish files (that are not quite temp files in the current sense) in
> the absence of no other place managing to do so, such as
> BufFileClose(). How can you be sure that you'll actually close() the
> FD itself (not vFD) within fd.c in the event of an error? Or Delete(),
> which does some LRU maintenance for backend's local VfdCache?

Yeah, I definitely need to use resowner.c.  The only thing I want to
opt out of is automatic file deletion in that code path.

> If I follow the new code correctly, then it doesn't matter that you've
> unlink()'d to take care of the more obvious resource management chore.
> You can still have a reference leak like this, if I'm not mistaken,
> because you still have backend local state (local VfdCache) that is
> left totally decoupled with the new "shadow resource manager" for
> shared BufFiles.

You're right.  The attached version fixes these problems.  The
BufFiles created or opened in this new way now participate in both of
our leak-detection and clean-up schemes:  the one in resowner.c
(because I'm now explicitly registering with it as I had failed to do
before) and the one in CleanupTempFiles (because FD_CLOSE_AT_EOXACT is
set, which I already had in the previous version for the creator, but
not the opener of such a file).  I tested by commenting out my
explicit BufFileClose calls to check that resowner.c starts
complaining, and then by commenting out the resowner registration too
to check that CleanupTempFiles starts complaining.

>> As shown in 0008-hj-shared-buf-file-v8.patch.  Thoughts?
>
> A less serious issue I've also noticed is that you add palloc() calls,
> implicitly using the current memory context, within buffile.c.
> BufFileOpenTagged() has some, for example. However, there is a note
> that we don't need to save the memory context when we open a BufFile
> because we always 

Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-03-21 Thread Peter Geoghegan
On Tue, Mar 21, 2017 at 7:18 PM, Peter Geoghegan  wrote:
>> As shown in 0008-hj-shared-buf-file-v8.patch.  Thoughts?
>
> A less serious issue I've also noticed is that you add palloc() calls,
> implicitly using the current memory context, within buffile.c.
> BufFileOpenTagged() has some, for example. However, there is a note
> that we don't need to save the memory context when we open a BufFile
> because we always repalloc(). That is no longer the case here.

Similarly, I think that your new type of BufFile has no need to save
CurrentResourceOwner, because it won't ever actually be used. I
suppose that you should at least note this in comments.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-03-21 Thread Peter Geoghegan
On Tue, Mar 21, 2017 at 5:07 AM, Thomas Munro
 wrote:
>> buffile.c should stop pretending to care about anything other than
>> temp files, IMV. 100% of all clients that want temporary files go
>> through buffile.c. 100% of all clients that want non-temp files (files
>> which are not marked FD_TEMPORARY) access fd.c directly, rather than
>> going through buffile.c.
>
> I still need BufFile because I want buffering.
>
> There are 3 separate characteristics enabled by flags with 'temporary'
> in their name.  I think we should consider separating the concerns by
> splitting and renaming them:
>
> 1.  Segmented BufFile behaviour.  I propose renaming BufFile's isTemp
> member to isSegmented, because that is what it really does.   I want
> that feature independently without getting confused about lifetimes.
> Tested with small MAX_PHYSICAL_FILESIZE as you suggested.

I would have proposed to get rid of the isTemp field entirely. It is
always true with current usage, any only #ifdef NOT_USED code presumes
that it could be any other way. BufFile is all about temp files, which
ISTM should be formalized. The whole point of BufFile is to segment
fd.c temp file segments. Who would ever want to use BufFile without
that capability anyway?

> 2.  The temp_file_limit system.  Currently this applies to fd.c files
> opened with FD_TEMPORARY.  You're right that we shouldn't be able to
> escape that sanity check on disk space just because we want to manage
> disk file ownership differently.  I propose that we create a new flag
> FD_TEMP_FILE_LIMIT that can be set independentlyisTemp of the flags
> controlling disk file lifetime.  When working with SharedBufFileSet,
> the limit applies to each backend in respect of files it created,
> while it has them open.  This seems a lot simpler than any
> shared-temp-file-limit type scheme and is vaguely similar to the way
> work_mem applies in each backend for parallel query.

I agree that that makes sense as a user-visible behavior of
temp_file_limit. This user-visible behavior is what I actually
implemented for parallel CREATE INDEX.

> 3.  Delete-on-close/delete-at-end-of-xact.  I don't want to use that
> facility so I propose disconnecting it from the above.  We c{ould
> rename those fd.c-internal flags FD_TEMPORARY and FD_XACT_TEMPORARY to
> FD_DELETE_AT_CLOSE and FD_DELETE_AT_EOXACT.

This reliably unlink()s all files, albeit while relying on unlink()
ENOENT as a condition that terminates deletion of one particular
worker's BufFile's segments. However, because you effectively no
longer use resowner.c, ISTM that there is still a resource leak in
error paths. ResourceOwnerReleaseInternal() won't call FileClose() for
temp-ish files (that are not quite temp files in the current sense) in
the absence of no other place managing to do so, such as
BufFileClose(). How can you be sure that you'll actually close() the
FD itself (not vFD) within fd.c in the event of an error? Or Delete(),
which does some LRU maintenance for backend's local VfdCache?

If I follow the new code correctly, then it doesn't matter that you've
unlink()'d to take care of the more obvious resource management chore.
You can still have a reference leak like this, if I'm not mistaken,
because you still have backend local state (local VfdCache) that is
left totally decoupled with the new "shadow resource manager" for
shared BufFiles.

> As shown in 0008-hj-shared-buf-file-v8.patch.  Thoughts?

A less serious issue I've also noticed is that you add palloc() calls,
implicitly using the current memory context, within buffile.c.
BufFileOpenTagged() has some, for example. However, there is a note
that we don't need to save the memory context when we open a BufFile
because we always repalloc(). That is no longer the case here.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-03-21 Thread Thomas Munro
Hi,

Here is a new version of the patch series addressing complaints from
Rafia, Peter, Andres and Robert -- see below.

First, two changes not already covered in this thread:

1.  Today Robert asked me a question off-list that I hadn't previously
considered: since I am sharing tuples between backends, don't I have
the same kind of transient record remapping problems that tqueue.c has
to deal with?  The answer must be yes, and in fact it's a trickier
version because there are N 'senders' and N 'receivers' communicating
via the shared hash table.  So I decided to avoid the problem by not
planning shared hash tables if the tuples could include transient
RECORD types: see tlist_references_transient_type() in
0007-hj-shared-single-batch-v8.patch.  Perhaps in the future we can
find a way for parallel query to keep local types in sync, so this
restriction could be lifted.  (I've tested this with a specially
modified build, because I couldn't figure out how to actually get any
transient types to be considered in a parallel query, but if someone
has a suggestion for a good query for that I'd love to put one into
the regression test.)

2.  Earlier versions included support for Shared Hash (= one worker
builds, other workers wait, but we get to use work_mem * P memory) and
Parallel Shared Hash (= all workers build).  Shared Hash is by now
quite hard to reach, since so many hash join inner plans are now
parallelisable.  I decided to remove support for it from the latest
patch series: I think it adds cognitive load and patch lines for
little or no gain.  With time running out, I thought that it would be
better to rip it out for now to try to simplify things and avoid some
difficult questions about how to cost that mode.  It could be added
with a separate patch after some more study if it really does make
some sense.

>> On Mon, Mar 13, 2017 at 8:40 PM, Rafia Sabih
>>  wrote:
>>> In an attempt to test v7 of this patch on TPC-H 20 scale factor I found a
>>> few regressions,
>>> Q21: 52 secs on HEAD and  400 secs with this patch

As already mentioned there is a planner bug which we can fix
separately from this patch series.  Until that is resolved, please see
that other thread[1] for the extra tweak require for sane results when
testing Q21.

Even with that tweak, there was a slight regression with fewer than 3
workers at 1GB for Q21.  That turned out to be because the patched
version was not always using as many workers as unpatched.  To fix
that, I had to rethink the deadlock avoidance system to make it a bit
less conservative about giving up workers: see
src/backend/utils/misc/leader_gate.c in
0007-hj-shared-single-batch-v8.patch.

Here are some speed-up numbers comparing master to patched that I
recorded on TPCH scale 10 with work_mem = 1GB.  These are the queries
whose plans change with the patch.  Both master and v8 were patched
with fix-neqseljoin-for-semi-joins.patch.

 query | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
---+---+---+---+---+---+---+---+---+---
 Q3| 0.94x | 1.06x | 1.25x | 1.46x | 1.64x | 1.87x | 1.99x | 1.67x | 1.67x
 Q5| 1.17x | 1.03x | 1.23x | 1.27x | 1.44x | 0.56x | 0.95x | 0.94x | 1.16x
 Q7| 1.13x | 1.04x | 1.31x | 1.06x | 1.15x | 1.28x | 1.31x | 1.35x | 1.13x
 Q8| 0.99x | 1.13x | 1.23x | 1.22x | 1.36x | 0.42x | 0.82x | 0.78x | 0.81x
 Q9| 1.16x | 0.95x | 1.92x | 1.68x | 1.90x | 1.89x | 2.02x | 2.05x | 1.81x
 Q10   | 1.01x | 1.03x | 1.08x | 1.10x | 1.16x | 1.17x | 1.09x | 1.01x | 1.07x
 Q12   | 1.03x | 1.19x | 1.42x | 0.75x | 0.74x | 1.00x | 0.99x | 1.00x | 1.01x
 Q13   | 1.10x | 1.66x | 1.99x | 1.00x | 1.12x | 1.00x | 1.12x | 1.01x | 1.13x
 Q14   | 0.97x | 1.13x | 1.22x | 1.45x | 1.43x | 1.55x | 1.55x | 1.50x | 1.45x
 Q16   | 1.02x | 1.13x | 1.07x | 1.09x | 1.10x | 1.10x | 1.13x | 1.10x | 1.11x
 Q18   | 1.05x | 1.43x | 1.33x | 1.21x | 1.07x | 1.57x | 1.76x | 1.09x | 1.09x
 Q21   | 0.99x | 1.01x | 1.07x | 1.18x | 1.28x | 1.37x | 1.63x | 1.26x | 1.60x

These tests are a bit short and noisy and clearly there are some
strange dips in there that need some investigation but the trend is
positive.

Here are some numbers from some simple test joins, so that you can see
the raw speedup of large hash joins without all the other things going
on in those TPCH plans.  I executed 1-join, 2-join and 3-join queries
like this:

  CREATE TABLE simple AS
  SELECT generate_series(1, 1000) AS id,
'aa';
  ANALYZE simple;

  SELECT COUNT(*)
FROM simple r
JOIN simple s USING (id);

  SELECT COUNT(*)
FROM simple r
JOIN simple s USING (id)
JOIN simple t USING (id);

  SELECT COUNT(*)
FROM simple r
JOIN simple s USING (id)
JOIN simple t USING (id)
JOIN simple u USING (id);

Unpatched master can make probing go faster by adding workers, but not
building, so in these self-joins the ability to scale with more CPUs
is limited (here w = 1 

Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-03-17 Thread Thomas Munro
On Tue, Mar 14, 2017 at 8:03 AM, Thomas Munro
 wrote:
> On Mon, Mar 13, 2017 at 8:40 PM, Rafia Sabih
>  wrote:
>> In an attempt to test v7 of this patch on TPC-H 20 scale factor I found a
>> few regressions,
>> Q21: 52 secs on HEAD and  400 secs with this patch
>
> Thanks Rafia.  Robert just pointed out off-list that there is a bogus
> 0 row estimate in here:
>
> ->  Parallel Hash Semi Join  (cost=1006599.34..1719227.30 rows=0
> width=24) (actual time=38716.488..100933.250 rows=7315896 loops=5)
>
> Will investigate, thanks.

There are two problems here.

1.  There is a pre-existing cardinality estimate problem for
semi-joins with <> filters.  The big Q21 regression reported by Rafia
is caused by that phenomenon, probably exacerbated by another bug that
allowed 0 cardinality estimates to percolate inside the planner.
Estimates have been clamped at or above 1.0 since her report by commit
1ea60ad6.

I started a new thread to discuss that because it's unrelated to this
patch, except insofar as it confuses the planner about Q21 (with or
without parallelism).  Using one possible selectivity tweak suggested
by Tom Lane, I was able to measure significant speedups on otherwise
unpatched master:

https://www.postgresql.org/message-id/CAEepm%3D11BiYUkgXZNzMtYhXh4S3a9DwUP8O%2BF2_ZPeGzzJFPbw%40mail.gmail.com

2.  If you compare master tweaked as above against the latest version
of my patch series with the tweak, then the patched version always
runs faster with 4 or more workers, but with only 1 or 2 workers Q21
is a bit slower... but not always.  I realised that there was a
bi-modal distribution of execution times.  It looks like my 'early
exit' protocol, designed to make tuple-queue deadlock impossible, is
often causing us to lose a worker.  I am working on that now.

I have code changes for Peter G's and Andres's feedback queued up and
will send a v8 series shortly, hopefully with a fix for problem 2
above.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-03-13 Thread Thomas Munro
On Mon, Mar 13, 2017 at 8:40 PM, Rafia Sabih
 wrote:
> In an attempt to test v7 of this patch on TPC-H 20 scale factor I found a
> few regressions,
> Q21: 52 secs on HEAD and  400 secs with this patch

Thanks Rafia.  Robert just pointed out off-list that there is a bogus
0 row estimate in here:

->  Parallel Hash Semi Join  (cost=1006599.34..1719227.30 rows=0
width=24) (actual time=38716.488..100933.250 rows=7315896 loops=5)

Will investigate, thanks.

> Q8: 8 secs on HEAD to 14 secs with patch

Also looking into this one.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-03-13 Thread Rafia Sabih
On Thu, Mar 9, 2017 at 5:28 PM, Thomas Munro 
wrote:

> On Wed, Mar 8, 2017 at 12:58 PM, Andres Freund  wrote:
> > 0002: Check hash join work_mem usage at the point of chunk allocation.
> >
> > Modify the existing hash join code to detect work_mem exhaustion at
> > the point where chunks are allocated, instead of checking after every
> > tuple insertion.  This matches the logic used for estimating, and more
> > importantly allows for some parallelism in later patches.
> >
> > diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/
> nodeHash.c
> > index 406c180..af1b66d 100644
> > --- a/src/backend/executor/nodeHash.c
> > +++ b/src/backend/executor/nodeHash.c
> > @@ -48,7 +48,8 @@ static void ExecHashSkewTableInsert(HashJoinTable
> hashtable,
> > int bucketNumber);
> >  static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
> >
> > -static void *dense_alloc(HashJoinTable hashtable, Size size);
> > +static void *dense_alloc(HashJoinTable hashtable, Size size,
> > +bool respect_work_mem);
> >
> > I still dislike this, but maybe Robert's point of:
> >
> > On 2017-02-16 08:57:21 -0500, Robert Haas wrote:
> >> On Wed, Feb 15, 2017 at 9:36 PM, Andres Freund 
> wrote:
> >> > Isn't it kinda weird to do this from within dense_alloc()?  I mean
> that
> >> > dumps a lot of data to disk, frees a bunch of memory and so on - not
> >> > exactly what "dense_alloc" implies.  Isn't the free()ing part also
> >> > dangerous, because the caller might actually use some of that memory,
> >> > like e.g. in ExecHashRemoveNextSkewBucket() or such.  I haven't looked
> >> > deeply enough to check whether that's an active bug, but it seems like
> >> > inviting one if not.
> >>
> >> I haven't looked at this, but one idea might be to just rename
> >> dense_alloc() to ExecHashBlahBlahSomething().  If there's a real
> >> abstraction layer problem here then we should definitely fix it, but
> >> maybe it's just the angle at which you hold your head.
> >
> > Is enough.
>
> There is a problem here.  It can determine that it needs to increase
> the number of batches, effectively splitting the current batch, but
> then the caller goes on to insert the current tuple anyway, even
> though it may no longer belong in this batch.  I will post a fix for
> that soon.  I will also refactor it so that it doesn't do that work
> inside dense_alloc.  You're right, that's too weird.
>
> In the meantime, here is a new patch series addressing the other
> things you raised.
>
> > 0003: Scan for unmatched tuples in a hash join one chunk at a time.
> >
> >
> > @@ -1152,8 +1155,65 @@ bool
> >  ExecScanHashTableForUnmatched(HashJoinState *hjstate, ExprContext
> *econtext)
> >  {
> > HashJoinTable hashtable = hjstate->hj_HashTable;
> > -   HashJoinTuple hashTuple = hjstate->hj_CurTuple;
> > +   HashJoinTuple hashTuple;
> > +   MinimalTuple tuple;
> > +
> > +   /*
> > +* First, process the queue of chunks holding tuples that are in
> regular
> > +* (non-skew) buckets.
> > +*/
> > +   for (;;)
> > +   {
> > +   /* Do we need a new chunk to scan? */
> > +   if (hashtable->current_chunk == NULL)
> > +   {
> > +   /* Have we run out of chunks to scan? */
> > +   if (hashtable->unmatched_chunks == NULL)
> > +   break;
> > +
> > +   /* Pop the next chunk from the front of the
> queue. */
> > +   hashtable->current_chunk =
> hashtable->unmatched_chunks;
> > +   hashtable->unmatched_chunks =
> hashtable->current_chunk->next;
> > +   hashtable->current_chunk_index = 0;
> > +   }
> > +
> > +   /* Have we reached the end of this chunk yet? */
> > +   if (hashtable->current_chunk_index >=
> hashtable->current_chunk->used)
> > +   {
> > +   /* Go around again to get the next chunk from
> the queue. */
> > +   hashtable->current_chunk = NULL;
> > +   continue;
> > +   }
> > +
> > +   /* Take the next tuple from this chunk. */
> > +   hashTuple = (HashJoinTuple)
> > +   (hashtable->current_chunk->data +
> hashtable->current_chunk_index);
> > +   tuple = HJTUPLE_MINTUPLE(hashTuple);
> > +   hashtable->current_chunk_index +=
> > +   MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
> > +
> > +   /* Is it unmatched? */
> > +   if (!HeapTupleHeaderHasMatch(tuple))
> > +   {
> > +   TupleTableSlot *inntuple;
> > +
> > +   /* insert hashtable's tuple into exec slot */
> > +  

Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-03-09 Thread Peter Geoghegan
On Thu, Mar 9, 2017 at 3:58 AM, Thomas Munro
 wrote:
> In the meantime, here is a new patch series addressing the other
> things you raised.

Please see my remarks on 0007-hj-shared-buf-file-v7.patch over on the
"on_dsm_detach() callback and parallel tuplesort BufFile resource
management" thread. They still apply to this latest version of the
patch series.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-03-09 Thread Thomas Munro
On Wed, Mar 8, 2017 at 12:58 PM, Andres Freund  wrote:
> 0002: Check hash join work_mem usage at the point of chunk allocation.
>
> Modify the existing hash join code to detect work_mem exhaustion at
> the point where chunks are allocated, instead of checking after every
> tuple insertion.  This matches the logic used for estimating, and more
> importantly allows for some parallelism in later patches.
>
> diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
> index 406c180..af1b66d 100644
> --- a/src/backend/executor/nodeHash.c
> +++ b/src/backend/executor/nodeHash.c
> @@ -48,7 +48,8 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable,
> int bucketNumber);
>  static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
>
> -static void *dense_alloc(HashJoinTable hashtable, Size size);
> +static void *dense_alloc(HashJoinTable hashtable, Size size,
> +bool respect_work_mem);
>
> I still dislike this, but maybe Robert's point of:
>
> On 2017-02-16 08:57:21 -0500, Robert Haas wrote:
>> On Wed, Feb 15, 2017 at 9:36 PM, Andres Freund  wrote:
>> > Isn't it kinda weird to do this from within dense_alloc()?  I mean that
>> > dumps a lot of data to disk, frees a bunch of memory and so on - not
>> > exactly what "dense_alloc" implies.  Isn't the free()ing part also
>> > dangerous, because the caller might actually use some of that memory,
>> > like e.g. in ExecHashRemoveNextSkewBucket() or such.  I haven't looked
>> > deeply enough to check whether that's an active bug, but it seems like
>> > inviting one if not.
>>
>> I haven't looked at this, but one idea might be to just rename
>> dense_alloc() to ExecHashBlahBlahSomething().  If there's a real
>> abstraction layer problem here then we should definitely fix it, but
>> maybe it's just the angle at which you hold your head.
>
> Is enough.

There is a problem here.  It can determine that it needs to increase
the number of batches, effectively splitting the current batch, but
then the caller goes on to insert the current tuple anyway, even
though it may no longer belong in this batch.  I will post a fix for
that soon.  I will also refactor it so that it doesn't do that work
inside dense_alloc.  You're right, that's too weird.

In the meantime, here is a new patch series addressing the other
things you raised.

> 0003: Scan for unmatched tuples in a hash join one chunk at a time.
>
>
> @@ -1152,8 +1155,65 @@ bool
>  ExecScanHashTableForUnmatched(HashJoinState *hjstate, ExprContext *econtext)
>  {
> HashJoinTable hashtable = hjstate->hj_HashTable;
> -   HashJoinTuple hashTuple = hjstate->hj_CurTuple;
> +   HashJoinTuple hashTuple;
> +   MinimalTuple tuple;
> +
> +   /*
> +* First, process the queue of chunks holding tuples that are in 
> regular
> +* (non-skew) buckets.
> +*/
> +   for (;;)
> +   {
> +   /* Do we need a new chunk to scan? */
> +   if (hashtable->current_chunk == NULL)
> +   {
> +   /* Have we run out of chunks to scan? */
> +   if (hashtable->unmatched_chunks == NULL)
> +   break;
> +
> +   /* Pop the next chunk from the front of the queue. */
> +   hashtable->current_chunk = 
> hashtable->unmatched_chunks;
> +   hashtable->unmatched_chunks = 
> hashtable->current_chunk->next;
> +   hashtable->current_chunk_index = 0;
> +   }
> +
> +   /* Have we reached the end of this chunk yet? */
> +   if (hashtable->current_chunk_index >= 
> hashtable->current_chunk->used)
> +   {
> +   /* Go around again to get the next chunk from the 
> queue. */
> +   hashtable->current_chunk = NULL;
> +   continue;
> +   }
> +
> +   /* Take the next tuple from this chunk. */
> +   hashTuple = (HashJoinTuple)
> +   (hashtable->current_chunk->data + 
> hashtable->current_chunk_index);
> +   tuple = HJTUPLE_MINTUPLE(hashTuple);
> +   hashtable->current_chunk_index +=
> +   MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
> +
> +   /* Is it unmatched? */
> +   if (!HeapTupleHeaderHasMatch(tuple))
> +   {
> +   TupleTableSlot *inntuple;
> +
> +   /* insert hashtable's tuple into exec slot */
> +   inntuple = ExecStoreMinimalTuple(tuple,
> + 
>hjstate->hj_HashTupleSlot,
> + 
>false); /* do not pfree */
> + 

Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-03-08 Thread Thomas Munro
On Wed, Mar 8, 2017 at 1:15 PM, Tom Lane  wrote:
> Andres Freund  writes:
>> +++ b/src/include/storage/barrier.h
>> +#include "postgres.h"
>
>> Huh, that normally shouldn't be in a header.  I see you introduced that
>> in a bunch of other places too - that really doesn't look right to me.
>
> That is absolutely not project style and is not acceptable.
>
> The core reason why not is that postgres.h/postgres_fe.h/c.h have to be
> the *first* inclusion in every compilation, for arcane portability reasons
> you really don't want to know about.  (Suffice it to say that on some
> platforms, stdio.h isn't all that std.)  Our coding rule for that is that
> we put the appropriate one of these first in every .c file, while .h files
> always assume that it's been included already.  As soon as you break that
> convention, it becomes unclear from looking at a .c file whether the
> ordering requirement has been satisfied.  Also, since now you've moved
> the must-be-first requirement to some other header file(s), you risk
> breakage when somebody applies another project convention about
> alphabetizing #include references for all headers other than those magic
> ones.

Thanks for the explanation.  Will post a new series addressing this
and other complaints from Andres shortly.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-03-07 Thread Tom Lane
Andres Freund  writes:
> +++ b/src/include/storage/barrier.h
> +#include "postgres.h"

> Huh, that normally shouldn't be in a header.  I see you introduced that
> in a bunch of other places too - that really doesn't look right to me.

That is absolutely not project style and is not acceptable.

The core reason why not is that postgres.h/postgres_fe.h/c.h have to be
the *first* inclusion in every compilation, for arcane portability reasons
you really don't want to know about.  (Suffice it to say that on some
platforms, stdio.h isn't all that std.)  Our coding rule for that is that
we put the appropriate one of these first in every .c file, while .h files
always assume that it's been included already.  As soon as you break that
convention, it becomes unclear from looking at a .c file whether the
ordering requirement has been satisfied.  Also, since now you've moved
the must-be-first requirement to some other header file(s), you risk
breakage when somebody applies another project convention about
alphabetizing #include references for all headers other than those magic
ones.

In short, don't even think of doing this.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-03-07 Thread Andres Freund
Hi,

0001: Do hash join work_mem accounting in chunks.

Don't think there's much left to say.

0002: Check hash join work_mem usage at the point of chunk allocation.

Modify the existing hash join code to detect work_mem exhaustion at
the point where chunks are allocated, instead of checking after every
tuple insertion.  This matches the logic used for estimating, and more
importantly allows for some parallelism in later patches.

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 406c180..af1b66d 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -48,7 +48,8 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable,
int bucketNumber);
 static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);

-static void *dense_alloc(HashJoinTable hashtable, Size size);
+static void *dense_alloc(HashJoinTable hashtable, Size size,
+bool respect_work_mem);

I still dislike this, but maybe Robert's point of:

On 2017-02-16 08:57:21 -0500, Robert Haas wrote:
> On Wed, Feb 15, 2017 at 9:36 PM, Andres Freund  wrote:
> > Isn't it kinda weird to do this from within dense_alloc()?  I mean that
> > dumps a lot of data to disk, frees a bunch of memory and so on - not
> > exactly what "dense_alloc" implies.  Isn't the free()ing part also
> > dangerous, because the caller might actually use some of that memory,
> > like e.g. in ExecHashRemoveNextSkewBucket() or such.  I haven't looked
> > deeply enough to check whether that's an active bug, but it seems like
> > inviting one if not.
>
> I haven't looked at this, but one idea might be to just rename
> dense_alloc() to ExecHashBlahBlahSomething().  If there's a real
> abstraction layer problem here then we should definitely fix it, but
> maybe it's just the angle at which you hold your head.

Is enough.


0003: Scan for unmatched tuples in a hash join one chunk at a time.


@@ -1152,8 +1155,65 @@ bool
 ExecScanHashTableForUnmatched(HashJoinState *hjstate, ExprContext *econtext)
 {
HashJoinTable hashtable = hjstate->hj_HashTable;
-   HashJoinTuple hashTuple = hjstate->hj_CurTuple;
+   HashJoinTuple hashTuple;
+   MinimalTuple tuple;
+
+   /*
+* First, process the queue of chunks holding tuples that are in regular
+* (non-skew) buckets.
+*/
+   for (;;)
+   {
+   /* Do we need a new chunk to scan? */
+   if (hashtable->current_chunk == NULL)
+   {
+   /* Have we run out of chunks to scan? */
+   if (hashtable->unmatched_chunks == NULL)
+   break;
+
+   /* Pop the next chunk from the front of the queue. */
+   hashtable->current_chunk = hashtable->unmatched_chunks;
+   hashtable->unmatched_chunks = 
hashtable->current_chunk->next;
+   hashtable->current_chunk_index = 0;
+   }
+
+   /* Have we reached the end of this chunk yet? */
+   if (hashtable->current_chunk_index >= 
hashtable->current_chunk->used)
+   {
+   /* Go around again to get the next chunk from the 
queue. */
+   hashtable->current_chunk = NULL;
+   continue;
+   }
+
+   /* Take the next tuple from this chunk. */
+   hashTuple = (HashJoinTuple)
+   (hashtable->current_chunk->data + 
hashtable->current_chunk_index);
+   tuple = HJTUPLE_MINTUPLE(hashTuple);
+   hashtable->current_chunk_index +=
+   MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
+
+   /* Is it unmatched? */
+   if (!HeapTupleHeaderHasMatch(tuple))
+   {
+   TupleTableSlot *inntuple;
+
+   /* insert hashtable's tuple into exec slot */
+   inntuple = ExecStoreMinimalTuple(tuple,
+   
 hjstate->hj_HashTupleSlot,
+   
 false); /* do not pfree */
+   econtext->ecxt_innertuple = inntuple;
+
+   /* reset context each time (see below for explanation) 
*/
+   ResetExprContext(econtext);
+   return true;
+   }
+   }

I suspect this might actually be slower than the current/old logic,
because the current_chunk tests are repeated every loop. I think
retaining the two loops the previous code had makes sense, i.e. one to
find a relevant chunk, and one to iterate through all tuples in a chunk,
checking for an unmatched one.


Have you run a performance comparison pre/post this patch?  I 

Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-03-07 Thread Andres Freund
Hi,

On 2017-03-07 02:57:30 +1300, Thomas Munro wrote:
> I'm not sure why nodeHashjoin.c is doing raw batchfile read/write
> operations anyway; why not use tuplestore.c for that (as
> tuplestore.c's comments incorrectly say is the case)?

Another reason presumably is that using tuplestores would make it harder
to control the amount of memory used - we do *not* want an extra set of
work_mem used here, right?

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-03-06 Thread Thomas Munro
On Wed, Mar 1, 2017 at 10:40 PM, Thomas Munro
 wrote:
> I'm testing a new version which incorporates feedback from Andres and
> Ashutosh, and is refactored to use a new SharedBufFileSet component to
> handle batch files, replacing the straw-man implementation from the v5
> patch series.  I've set this to waiting-on-author and will post v6
> tomorrow.

I created a system for reference counted partitioned temporary files
called SharedBufFileSet: see 0007-hj-shared-buf-file.patch.  Then I
ripped out the code for sharing batch files that I previously had
cluttering up nodeHashjoin.c, and refactored it into a new component
called a SharedTuplestore which wraps a SharedBufFileSet and gives it
a tuple-based interface: see 0008-hj-shared-tuplestore.patch.  The
name implies aspirations of becoming a more generally useful shared
analogue of tuplestore, but for now it supports only the exact access
pattern needed for hash join batches ($10 wrench).

It creates temporary files like this:

  base/pgsql_tmp/pgsql_tmp[pid].[set].[partition].[participant].[segment]

I'm not sure why nodeHashjoin.c is doing raw batchfile read/write
operations anyway; why not use tuplestore.c for that (as
tuplestore.c's comments incorrectly say is the case)?  Maybe because
Tuplestore's interface doesn't support storing the extra hash value.
In SharedTuplestore I solved that problem by introducing an optional
fixed sized piece of per-tuple meta-data.  Another thing that is
different about SharedTuplestore is that it supports partitions, which
is convenient for this project and probably other parallel projects
too.

In order for workers to be able to participate in reference counting
schemes based on DSM segment lifetime, I had to give the
Exec*InitializeWorker() functions access to the dsm_segment object,
whereas previously they received only the shm_toc in order to access
its contents.  I invented ParallelWorkerContext which has just two
members 'seg' and 'toc': see
0005-hj-let-node-have-seg-in-worker.patch.  I didn't touch the FDW API
or custom scan API where they currently take toc, though I can see
that there is an argument that they should; changing those APIs seems
like a bigger deal.  Another approach would be to use ParallelContext,
as passed into ExecXXXInitializeDSM, with the members that are not
applicable to workers zeroed out.  Thoughts?

I got rid of the ExecDetachXXX stuff I had invented in the last
version, because acf555bc fixed the problem a better way.

I found that I needed to put use more than one toc entry for a single
executor node, in order to reserve space for the inner and outer
SharedTuplestore objects.  So I invented a way to make more extra keys
with PARALLEL_KEY_EXECUTOR_NTH(plan_node_id, N).

-- 
Thomas Munro
http://www.enterprisedb.com


parallel-shared-hash-v6.tgz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-03-01 Thread Thomas Munro
On Thu, Feb 16, 2017 at 9:08 PM, Thomas Munro
 wrote:
> On Thu, Feb 16, 2017 at 3:36 PM, Andres Freund  wrote:
>> That's it for now...
>
> Thanks!  Plenty for me to go away and think about.  I will post a new
> version soon.

I'm testing a new version which incorporates feedback from Andres and
Ashutosh, and is refactored to use a new SharedBufFileSet component to
handle batch files, replacing the straw-man implementation from the v5
patch series.  I've set this to waiting-on-author and will post v6
tomorrow.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-02-16 Thread Robert Haas
On Wed, Feb 15, 2017 at 9:36 PM, Andres Freund  wrote:
> 0002-hj-add-dtrace-probes-v5.patch
>
> Hm. I'm personally very unenthusiastic about addming more of these, and
> would rather rip all of them out.  I tend to believe that static
> problems simply aren't a good approach for anything requiring a lot of
> detail.  But whatever.

I'm not a big fan of either static problems or static probes, myself.

> Isn't it kinda weird to do this from within dense_alloc()?  I mean that
> dumps a lot of data to disk, frees a bunch of memory and so on - not
> exactly what "dense_alloc" implies.  Isn't the free()ing part also
> dangerous, because the caller might actually use some of that memory,
> like e.g. in ExecHashRemoveNextSkewBucket() or such.  I haven't looked
> deeply enough to check whether that's an active bug, but it seems like
> inviting one if not.

I haven't looked at this, but one idea might be to just rename
dense_alloc() to ExecHashBlahBlahSomething().  If there's a real
abstraction layer problem here then we should definitely fix it, but
maybe it's just the angle at which you hold your head.

> To me that is a weakness in the ExecShutdownNode() API - imo child nodes
> should get the chance to shutdown before the upper-level node.
> ExecInitNode/ExecEndNode etc give individual nodes the freedom to do
> things in the right order, but ExecShutdownNode() doesn't.  I don't
> quite see why we'd want to invent a separate ExecDetachNode() that'd be
> called immediately before ExecShutdownNode().

Interestingly, the same point came up on the Parallel Bitmap Heap Scan thread.

> An easy way to change that would be to return in the
> ExecShutdownNode()'s T_GatherState case, and delegate the responsibility
> of calling it on Gather's children to ExecShutdownGather().
> Alternatively we could make it a full-blown thing like ExecInitNode()
> that every node needs to implement, but that seems a bit painful.

I was thinking we should just switch things so that ExecShutdownNode()
recurses first, and then does the current node.  There's no real
excuse for a node terminating the shutdown scan early, I think.

> Or have I missed something here?
>
> Random aside: Wondered before if having to provide all executor
> callbacks is a weakness of our executor integration, and whether it
> shouldn't be a struct of callbacks instead...

I honestly have no idea whether that would be better or worse from the
CPU's point of view.

> I think it's a bit weird that the parallel path now has an exit path
> that the non-parallel path doesn't have.

I'm not sure about this particular one, but in general those are
pretty common.  For example, look at the changes
569174f1be92be93f5366212cc46960d28a5c5cd made to _bt_first().  When
you get there, you can discover that you aren't actually the first,
and that in fact all the work is already complete, and there's nothing
left for you to do but give up.

> Hm. That seems a bit on the detailed side - if we're going that way it
> seems likely that we'll end up with hundreds of wait events. I don't
> think gradually evolving wait events into something like a query
> progress framework is a good idea.

I'm pretty strongly of the opinion that we should not reuse multiple
wait events for the same purpose.  The whole point of the wait event
system is to identify what caused the wait.  Having relatively
recently done a ton of work to separate all of the waits in the system
and identify them individually, I'm loathe to see us start melding
things back together again.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-02-16 Thread Thomas Munro
On Thu, Feb 16, 2017 at 3:36 PM, Andres Freund  wrote:
> Hi,

Thanks for the review!

> FWIW, I'd appreciate if you'd added a short commit message to the
> individual patches - I find it helpful to have a littlebit more context
> while looking at them than just the titles.  Alternatively you can
> include that text when re-posting the series, but it's imo just as easy
> to have a short commit message (and just use format-patch).
>
> I'm for now using [1] as context.

Ok, will do.

> 0002-hj-add-dtrace-probes-v5.patch
>
> Hm. I'm personally very unenthusiastic about addming more of these, and
> would rather rip all of them out.  I tend to believe that static
> problems simply aren't a good approach for anything requiring a lot of
> detail.  But whatever.

Ok, I will get rid of these.  Apparently you aren't the only committer
who hates these.  (I have some other thoughts on that but will save
them for another time.)

> 0003-hj-refactor-memory-accounting-v5.patch
> @@ -424,15 +422,29 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, 
> bool useskew,
> if (ntuples <= 0.0)
> ntuples = 1000.0;
>
> -   /*
> -* Estimate tupsize based on footprint of tuple in hashtable... note 
> this
> -* does not allow for any palloc overhead.  The manipulations of 
> spaceUsed
> -* don't count palloc overhead either.
> -*/
> +   /* Estimate tupsize based on footprint of tuple in hashtable. */
>
> palloc overhead is still unaccounted for, no? In the chunked case that
> might not be much, I realize that (so that comment should probably have
> been updated when chunking was introduced).

Yeah, it seemed like an obsolete comment, but I'll put it back as that
isn't relevant to this patch.

> -   SizespaceUsed;  /* memory space currently 
> used by tuples */
> +   SizespaceUsed;  /* memory space currently 
> used by hashtable */
>
> It's not really the full hashtable, is it? The ->buckets array appears
> to still be unaccounted for.

It is actually the full hash table, and that is a change in this
patch.  See ExecHashTableCreate and ExecHashTableReset where is it set
to nbuckets * sizeof(HashJoinTuple), so that at all times it holds the
total size of buckets + all chunks.  Unlike the code in master, where
it's just the sum of all tuples while building, but then the bucket
space is added at the end in MultiExecHash.

> Looks ok.

Thanks!

> 0004-hj-refactor-batch-increases-v5.patch
>
> @@ -1693,10 +1689,12 @@ ExecHashRemoveNextSkewBucket(HashJoinTable hashtable)
>  }
>
>  /*
> - * Allocate 'size' bytes from the currently active HashMemoryChunk
> + * Allocate 'size' bytes from the currently active HashMemoryChunk.  If
> + * 'respect_work_mem' is true, this may cause the number of batches to be
> + * increased in an attempt to shrink the hash table.
>   */
>  static void *
> -dense_alloc(HashJoinTable hashtable, Size size)
> +dense_alloc(HashJoinTable hashtable, Size size, bool respect_work_mem)
>
> {
> HashMemoryChunk newChunk;
> char   *ptr;
> @@ -1710,6 +1708,15 @@ dense_alloc(HashJoinTable hashtable, Size size)
>  */
> if (size > HASH_CHUNK_THRESHOLD)
> {
> +   if (respect_work_mem &&
> +   hashtable->growEnabled &&
> +   hashtable->spaceUsed + HASH_CHUNK_HEADER_SIZE + size >
> +   hashtable->spaceAllowed)
> +   {
> +   /* work_mem would be exceeded: try to shrink hash 
> table */
> +   ExecHashIncreaseNumBatches(hashtable);
> +   }
> +
>
> Isn't it kinda weird to do this from within dense_alloc()?  I mean that
> dumps a lot of data to disk, frees a bunch of memory and so on - not
> exactly what "dense_alloc" implies.

Hmm.  Yeah I guess that is a bit weird.  My problem was that in the
shared case (later patch), when you call this function it has a fast
path and a slow path: the fast path just hands out more space from the
existing chunk, but the slow path acquires an LWLock and makes space
management decisions which have to be done sort of "atomically".  In
an earlier version I toyed with the idea of making dense_alloc return
NULL if you said respect_work_mem = true and it determined that you
need to increase the number of batches or go help other workers who
have already started doing so.  Then the batch increase stuff was not
in here, but callers who say respect_work_mem = true (the build and
reload loops) had to be prepared to loop and shrink if they get NULL,
or some wrapper function needs to do that them.  I will try it that
way again.

>  Isn't the free()ing part also
> dangerous, because the caller might actually use some of that memory,
> like e.g. in ExecHashRemoveNextSkewBucket() or such.  I haven't looked
> deeply enough to check whether that's an active bug, but it seems like
> inviting one if not.


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-02-15 Thread Andres Freund
Hi,

Just to summarize what you could read between the lines in the previous
mail: From a higher level POV the design here makes sense to me, I do
however think there's a good chunk of code-level improvements needed.

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-02-15 Thread Andres Freund
Hi,

On 2017-02-13 23:57:00 +1300, Thomas Munro wrote:
> Here's a new version to fix the problems reported by Rafia above.  The
> patch descriptions are as before but it starts from 0002 because 0001
> was committed as 7c5d8c16 (thanks, Andres).

FWIW, I'd appreciate if you'd added a short commit message to the
individual patches - I find it helpful to have a littlebit more context
while looking at them than just the titles.  Alternatively you can
include that text when re-posting the series, but it's imo just as easy
to have a short commit message (and just use format-patch).

I'm for now using [1] as context.


0002-hj-add-dtrace-probes-v5.patch

Hm. I'm personally very unenthusiastic about addming more of these, and
would rather rip all of them out.  I tend to believe that static
problems simply aren't a good approach for anything requiring a lot of
detail.  But whatever.


0003-hj-refactor-memory-accounting-v5.patch
@@ -424,15 +422,29 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, 
bool useskew,
if (ntuples <= 0.0)
ntuples = 1000.0;
 
-   /*
-* Estimate tupsize based on footprint of tuple in hashtable... note 
this
-* does not allow for any palloc overhead.  The manipulations of 
spaceUsed
-* don't count palloc overhead either.
-*/
+   /* Estimate tupsize based on footprint of tuple in hashtable. */

palloc overhead is still unaccounted for, no? In the chunked case that
might not be much, I realize that (so that comment should probably have
been updated when chunking was introduced).

-   SizespaceUsed;  /* memory space currently used 
by tuples */
+   SizespaceUsed;  /* memory space currently used 
by hashtable */

It's not really the full hashtable, is it? The ->buckets array appears
to still be unaccounted for.

Looks ok.


0004-hj-refactor-batch-increases-v5.patch

@@ -1693,10 +1689,12 @@ ExecHashRemoveNextSkewBucket(HashJoinTable hashtable)
 }
 
 /*
- * Allocate 'size' bytes from the currently active HashMemoryChunk
+ * Allocate 'size' bytes from the currently active HashMemoryChunk.  If
+ * 'respect_work_mem' is true, this may cause the number of batches to be
+ * increased in an attempt to shrink the hash table.
  */
 static void *
-dense_alloc(HashJoinTable hashtable, Size size)
+dense_alloc(HashJoinTable hashtable, Size size, bool respect_work_mem)

{
HashMemoryChunk newChunk;
char   *ptr;
@@ -1710,6 +1708,15 @@ dense_alloc(HashJoinTable hashtable, Size size)
 */
if (size > HASH_CHUNK_THRESHOLD)
{
+   if (respect_work_mem &&
+   hashtable->growEnabled &&
+   hashtable->spaceUsed + HASH_CHUNK_HEADER_SIZE + size >
+   hashtable->spaceAllowed)
+   {
+   /* work_mem would be exceeded: try to shrink hash table 
*/
+   ExecHashIncreaseNumBatches(hashtable);
+   }
+

Isn't it kinda weird to do this from within dense_alloc()?  I mean that
dumps a lot of data to disk, frees a bunch of memory and so on - not
exactly what "dense_alloc" implies.  Isn't the free()ing part also
dangerous, because the caller might actually use some of that memory,
like e.g. in ExecHashRemoveNextSkewBucket() or such.  I haven't looked
deeply enough to check whether that's an active bug, but it seems like
inviting one if not.


0005-hj-refactor-unmatched-v5.patch

I'm a bit confused as to why unmatched tuple scan is a good parallelism
target, but I might see later...

0006-hj-barrier-v5.patch

Skipping that here.


0007-hj-exec-detach-node-v5.patch

Hm. You write elsewhere:
> By the time ExecEndNode() runs in workers, ExecShutdownNode() has
> already run.  That's done on purpose because, for example, the hash
> table needs to survive longer than the parallel environment to allow
> EXPLAIN to peek at it.  But it means that the Gather node has thrown
> out the shared memory before any parallel-aware node below it gets to
> run its Shutdown and End methods.  So I invented ExecDetachNode()
> which runs before ExecShutdownNode(), giving parallel-aware nodes a
> chance to say goodbye before their shared memory vanishes.  Better
> ideas?

To me that is a weakness in the ExecShutdownNode() API - imo child nodes
should get the chance to shutdown before the upper-level node.
ExecInitNode/ExecEndNode etc give individual nodes the freedom to do
things in the right order, but ExecShutdownNode() doesn't.  I don't
quite see why we'd want to invent a separate ExecDetachNode() that'd be
called immediately before ExecShutdownNode().

An easy way to change that would be to return in the
ExecShutdownNode()'s T_GatherState case, and delegate the responsibility
of calling it on Gather's children to ExecShutdownGather().
Alternatively we could make it a full-blown thing like ExecInitNode()
that every node needs to implement, but that seems a 

Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-02-15 Thread Thomas Munro
Out of archeological curiosity, I was digging around in the hash join
code and RCS history from Postgres 4.2[1], and I was astounded to
discover that it had a parallel executor for Sequent SMP systems and
was capable of parallel hash joins as of 1991.  At first glance, it
seems to follow approximately the same design as I propose: share a
hash table and use a barrier to coordinate the switch from build phase
to probe phase and deal with later patches.  It uses mmap to get space
and then works with relative pointers.  See
src/backend/executor/n_hash.c and src/backend/executor/n_hashjoin.c.
Some of this might be described in Wei Hong's PhD thesis[2] which I
haven't had the pleasure of reading yet.

The parallel support is absent from the first commit in our repo
(1996), but there are some vestiges like RelativeAddr and ABSADDR used
to access the hash table (presumably needlessly) and also some
mentions of parallel machines in comments that survived up until
commit 26069a58 (1999).

[1] http://db.cs.berkeley.edu/postgres.html
[2] http://db.cs.berkeley.edu/papers/ERL-M93-28.pdf

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-02-13 Thread Thomas Munro
On Thu, Feb 9, 2017 at 2:03 AM, Ashutosh Bapat
 wrote:
>>
>> 0004-hj-refactor-batch-increases-v4.patch:
>>
>> Modify the existing hash join code to detect work_mem exhaustion at
>> the point where chunks are allocated, instead of checking after every
>> tuple insertion.  This matches the logic used for estimating, and more
>> importantly allows for some parallelism in later patches.
>
> The patch has three changes
> 1. change dense_alloc() to accept respect_workmem argument and use it
> within the function.
> 2. Move call to ExecHashIncreaseNumBatches() into dense_alloc() from
> ExecHashTableInsert() to account for memory before inserting new tuple
> 3. Check growEnabled before calling ExecHashIncreaseNumBatches().

Thanks for the review!

> I think checking growEnabled within ExecHashIncreaseNumBatches() is
> more easy to maintain that checking at every caller. If someone is to
> add a caller tomorrow, s/he has to remember to add the check.

Hmm.  Yeah.  In the later 0010 patch ExecHashIncreaseNumBatches will
be used in a slightly different way -- not for making decisions or
performing the hash table shrink, but only for reallocating the batch
arrays.  I will see if putting the growEnabled check back in there in
the 0004 patch and then refactoring in a later patch makes more sense
to someone reviewing the patches independently, for the next version.

> It might be better to add some comments in
> ExecHashRemoveNextSkewBucket() explaining why dense_alloc() should be
> called with respect_work_mem = false? ExecHashSkewTableInsert() does
> call ExecHashIncreaseNumBatches() after calling
> ExecHashRemoveNextSkewBucket() multiple times, so it looks like we do
> expect increase in space used and thus go beyond work_mem for a short
> while. Is there a way we can handle this case in dense_alloc()?

Right, that needs some explanation, which I'll add for the next
version.  The explanation is that while 'shrinking' the hash table, we
may need to go over the work_mem limit by one chunk for a short time.
That is already true in master, but by moving the work_mem checks into
dense_alloc I ran into the problem that dense_alloc might decide to
shrink the hash table which needs to call dense alloc.  Shrinking
works by spinning through all the chunks copying only the tuples we
want to keep into new chunks and freeing the old chunks as we go.  We
will temporarily go one chunk over work_mem when we allocate the first
new chunk but before we've freed the first old one.  We don't want
shrink operations to trigger recursive shrink operations, so we
disable respect for work_mem when calling it from
ExecHashIncreaseNumBatches.  In the course of regular hash table
loading, we want to respect work_mem.

Looking at the v5 patch series I posted yesterday, I see that in fact
ExecHashIncreaseNumBatches calls dense_alloc with respect_work_mem =
true in the 0004 patch, and then I corrected that mistake in the 0008
patch; I'll move the correction back to the 0004 patch in the next
version.

To reach ExecHashIncreaseNumBatches, see the "ugly" query in
hj-test-queries.sql (posted with v5).

In ExecHashRemoveNextSkewBucket I preserved the existing behaviour of
not caring about work_mem when performing the rare operation of
copying a tuple from the skew bucket into a dense_alloc memory chunk
so it can be inserted into a regular (non-skew) bucket.

> Is it possible that increasing the number of batches changes the
> bucket number of the tuple being inserted? If so, should we
> recalculate the bucket and batch of the tuple being inserted?

No -- see the function documentation for ExecHashGetBucketAndBatch.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-02-13 Thread Thomas Munro
On Thu, Feb 2, 2017 at 4:57 PM, Rafia Sabih
 wrote:
> On Thu, Feb 2, 2017 at 1:19 AM, Thomas Munro
>  wrote:
>> On Thu, Feb 2, 2017 at 3:34 AM, Rafia Sabih
>>  wrote:
>>> [ regressions ]
>>
>> Thanks Rafia.  At first glance this plan is using the Parallel Shared
>> Hash in one place where it should pay off, that is loading the orders
>> table, but the numbers are terrible.  I noticed that it uses batch
>> files and then has to increase the number of batch files, generating a
>> bunch of extra work, even though it apparently overestimated the
>> number of rows, though that's only ~9 seconds of ~60.  I am
>> investigating.
>
> Hi Thomas,
> Apart from the previously reported regression, there appear one more
> issue in this set of patches. At times, running a query using parallel
> hash it hangs up and all the workers including the master shows the
> following backtrace,

Here's a new version to fix the problems reported by Rafia above.  The
patch descriptions are as before but it starts from 0002 because 0001
was committed as 7c5d8c16 (thanks, Andres).

First, some quick master-vs-patch numbers from the queries listed with
regressions, using TPCH dbgen scale 10, work_mem = 64MB,
max_parallel_workers_per_gather = 4, shared_buffers = 8GB (the numbers
themselves not comparable as different scale and different hardware).
Better except for Q5 and Q8, which for some mysterious reason plans
only one worker and then loses.  I'm looking into that.

Q3 19917.682 -> 8649.822
Q5 4149.983 -> 4192.551
Q7 14453.721 -> 10303.911
Q8 1981.540 -> 8030.264
Q9 26928.102 -> 17384.607
Q10 16955.240 -> 14563.787

I plan to explore the performance space with a range of worker numbers
and work_mem sizes and do some analysis; more soon.

Changes:

1.  Fixed two bugs that resulted in ExecHashShrink sometimes hanging,
as reported by Rafia.  (1) When splitting the large v3 patch up into
smaller patches for v4, I'd managed to lose the line that initialises
shared->shrink_barrier, causing some occasional strange behaviour.
(2) I found a bug[1] in condition_variable.c that could cause hangs
and fixed that via a separate patch and the fix was committed as
3f3d60d3 (thanks, Robert).

2.  Simplified barrier.c by removing BarrierWaitSet(), because that
turned out to be unnecessary to implement rescan as I'd originally
thought, and was incompatible with the way BarrierDetach() works.  The
latter assumes that the phase only ever increments, so that
combination of features was broken.

3.  Sorted out the hash table sizing logic that was previously leading
to some strange decisions about batches.  This involved putting the
total estimated number of inner rows into the path and plan when there
is a partial inner plan, because plan_rows only has the partial
number.  I need to size the hash table correctly at execution time.
It seems a bit strange to do that specifically and only for Hash (see
rows_total in the 0008 patch)... should there be some more generic
way?  Should total rows go into Plan rather than HashPlan, or perhaps
the parallel divisor should go somewhere?

4.  Comments fixed and added based on Ashutosh's feedback on patch 0003.

5.  Various small bug fixes.

I've also attached a small set of test queries that hit the four
"modes" (for want of a better word) of our hash join algorithm for
dealing with different memory conditions, which I've nicknamed thus:

1.  "Good":  We estimate that the hash table will fit in work_mem, and
at execution time it does.  This patch makes that more likely because
[Parallel] Shared Hash gets to use more work_mem as discussed.

2.  "Bad":  We estimate that the hash table won't fit in work_mem, but
that if we partition it into N batches using some bits from the hash
value then each batch will fit in work_mem.  At execution time, each
batch does indeed fit into work_mem.  This is not ideal, because we
have to write out and read back in N - (1 / N) inner and outer tuples
(ie all batches except the first one, although actually costsize.c
always charges for all of them).  But it may still be better than
other plans, and the IO is sequential.  Currently Shared Hash
shouldn't be selected over (private) Hash if it would require batching
anyway due to the cpu_shared_tuple_cost tie-breaker: on the one had it
avoids a bunch of copies of the batch files being written out, but on
the other it introduces a bunch of synchronisation overhead.  Parallel
Shared Hash is fairly likely to be chosen if possible be due to
division of the inner relation's cost outweighing
cpu_shared_tuple_cost.

3.  "Ugly":  We planned for "good" or "bad" mode, but we ran out of
work_mem at some point during execution: this could be during the
initial hash table load, or while loading a subsequent batch.  So now
we double the number of batches, splitting the current batch and all
batches that haven't been processed yet into two in the hope of
shrinking the 

Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-02-08 Thread Ashutosh Bapat
>
> 0004-hj-refactor-batch-increases-v4.patch:
>
> Modify the existing hash join code to detect work_mem exhaustion at
> the point where chunks are allocated, instead of checking after every
> tuple insertion.  This matches the logic used for estimating, and more
> importantly allows for some parallelism in later patches.

The patch has three changes
1. change dense_alloc() to accept respect_workmem argument and use it
within the function.
2. Move call to ExecHashIncreaseNumBatches() into dense_alloc() from
ExecHashTableInsert() to account for memory before inserting new tuple
3. Check growEnabled before calling ExecHashIncreaseNumBatches().

I think checking growEnabled within ExecHashIncreaseNumBatches() is
more easy to maintain that checking at every caller. If someone is to
add a caller tomorrow, s/he has to remember to add the check.

It might be better to add some comments in
ExecHashRemoveNextSkewBucket() explaining why dense_alloc() should be
called with respect_work_mem = false? ExecHashSkewTableInsert() does
call ExecHashIncreaseNumBatches() after calling
ExecHashRemoveNextSkewBucket() multiple times, so it looks like we do
expect increase in space used and thus go beyond work_mem for a short
while. Is there a way we can handle this case in dense_alloc()?

Is it possible that increasing the number of batches changes the
bucket number of the tuple being inserted? If so, should we
recalculate the bucket and batch of the tuple being inserted?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-02-01 Thread Thomas Munro
On Thu, Feb 2, 2017 at 4:57 PM, Rafia Sabih
 wrote:
> Apart from the previously reported regression, there appear one more
> issue in this set of patches. At times, running a query using parallel
> hash it hangs up and all the workers including the master shows the
> following backtrace,

Ugh, thanks.  Investigating.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-02-01 Thread Rafia Sabih
On Thu, Feb 2, 2017 at 1:19 AM, Thomas Munro
 wrote:
> On Thu, Feb 2, 2017 at 3:34 AM, Rafia Sabih
>  wrote:
>> 9 | 62928.88   | 59077.909
>
> Thanks Rafia.  At first glance this plan is using the Parallel Shared
> Hash in one place where it should pay off, that is loading the orders
> table, but the numbers are terrible.  I noticed that it uses batch
> files and then has to increase the number of batch files, generating a
> bunch of extra work, even though it apparently overestimated the
> number of rows, though that's only ~9 seconds of ~60.  I am
> investigating.

Hi Thomas,
Apart from the previously reported regression, there appear one more
issue in this set of patches. At times, running a query using parallel
hash it hangs up and all the workers including the master shows the
following backtrace,

#0  0x3fff880c7de8 in __epoll_wait_nocancel () from /lib64/power8/libc.so.6
#1  0x104e2718 in WaitEventSetWaitBlock (set=0x100157bde90,
cur_timeout=-1, occurred_events=0x3fffdbe69698, nevents=1) at
latch.c:998
#2  0x104e255c in WaitEventSetWait (set=0x100157bde90,
timeout=-1, occurred_events=0x3fffdbe69698, nevents=1,
wait_event_info=134217745) at latch.c:950
#3  0x10512970 in ConditionVariableSleep (cv=0x3ffd736e05a4,
wait_event_info=134217745) at condition_variable.c:132
#4  0x104dbb1c in BarrierWaitSet (barrier=0x3ffd736e0594,
new_phase=1, wait_event_info=134217745) at barrier.c:97
#5  0x104dbb9c in BarrierWait (barrier=0x3ffd736e0594,
wait_event_info=134217745) at barrier.c:127
#6  0x103296a8 in ExecHashShrink (hashtable=0x3ffd73747dc0) at
nodeHash.c:1075
#7  0x1032c46c in dense_alloc_shared
(hashtable=0x3ffd73747dc0, size=40, shared=0x3fffdbe69eb8,
respect_work_mem=1 '\001') at nodeHash.c:2618
#8  0x1032a2f0 in ExecHashTableInsert
(hashtable=0x3ffd73747dc0, slot=0x100158f9e90, hashvalue=2389907270)
at nodeHash.c:1476
#9  0x10327fd0 in MultiExecHash (node=0x100158f9800) at nodeHash.c:296
#10 0x10306730 in MultiExecProcNode (node=0x100158f9800) at
execProcnode.c:577

The issue is not deterministic and straightforwardly reproducible,
sometimes after make clean, etc. queries run sometimes they hang up
again. I wanted to bring this to your notice hoping you might be
faster than me in picking up the exact reason behind this anomaly.

-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-02-01 Thread Thomas Munro
On Thu, Feb 2, 2017 at 3:34 AM, Rafia Sabih
 wrote:
> 9 | 62928.88   | 59077.909

Thanks Rafia.  At first glance this plan is using the Parallel Shared
Hash in one place where it should pay off, that is loading the orders
table, but the numbers are terrible.  I noticed that it uses batch
files and then has to increase the number of batch files, generating a
bunch of extra work, even though it apparently overestimated the
number of rows, though that's only ~9 seconds of ~60.  I am
investigating.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-01-31 Thread Ashutosh Bapat
>
>> However, ExecHashIncreaseNumBatches() may change the
>> number of buckets; the patch does not seem to account for spaceUsed changes
>> because of that.
>
> That's what this hunk is intended to do:
>
> @@ -795,6 +808,12 @@ ExecHashIncreaseNumBuckets(HashJoinTable hashtable)
> TRACE_POSTGRESQL_HASH_INCREASE_BUCKETS(hashtable->nbuckets,
>
> hashtable->nbuckets_optimal);
>
> +   /* account for the increase in space that will be used by buckets */
> +   hashtable->spaceUsed += sizeof(HashJoinTuple) *
> +   (hashtable->nbuckets_optimal - hashtable->nbuckets);
> +   if (hashtable->spaceUsed > hashtable->spacePeak)
> +   hashtable->spacePeak = hashtable->spaceUsed;
> +

Sorry, I missed that hunk. You are right, it's getting accounted for.

>>
>> In ExecHashTableReset(), do we want to update spacePeak while setting
>> spaceUsed.
>
> I figured there was no way that the new spaceUsed value could be
> bigger than spacePeak, because we threw out all chunks and have just
> the bucket array, and we had that number of buckets before, so
> spacePeak must at least have been set to a number >= this number
> either when we expanded to this many buckets, or when we created the
> hashtable in the first place.  Perhaps I should
> Assert(hashtable->spaceUsed <= hashtable->spacePeak).

That would help, better if you explain this with a comment before Assert.

>
>> While this patch tracks space usage more accurately, I am afraid we might be
>> overdoing it; a reason why we don't track space usage accurately now. But I
>> think I will leave it to be judged by someone who is more familiar with the
>> code and possibly has historical perspective.
>
> Well it's not doing more work; it doesn't make any practical
> difference whatsoever but it's technically doing less work than
> master, by doing memory accounting only when acquiring a new 32KB
> chunk.

This patch does more work while counting the space used by buckets, I
guess. AFAIU, right now, that happens only after the hash table is
built completely. But that's fine. I am not worried about whether the
it's less work or more.

> But if by overdoing it you mean that no one really cares about
> the tiny increase in accuracy so the patch on its own is a bit of a
> waste of time, you're probably right.

This is what I meant by overdoing; you have spelled it better.

> Depending on tuple size, you
> could imagine something like 64 bytes of header and unused space per
> 32KB chunk that we're not accounting for, and that's only 0.2%.  So I
> probably wouldn't propose this refactoring just on accuracy grounds
> alone.
>
> This refactoring is intended to pave the way for shared memory
> accounting in the later patches, which would otherwise generate ugly
> IPC if done for every time a tuple is allocated.  I considered using
> atomic add to count space per tuple, or maintaining per-backend local
> subtotals and periodically summing.  Then I realised that switching to
> per-chunk accounting would fix the IPC problem AND be justifiable on
> theoretical grounds.  When we allocate a new 32KB chunk, we really are
> using 32KB more of your memory.

+1.

Thanks for considering the comments.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-01-31 Thread Thomas Munro
On Wed, Feb 1, 2017 at 2:10 AM, Ashutosh Bapat
 wrote:
>>
>> 0003-hj-refactor-memory-accounting-v4.patch:
>> [...]
>>
> I looked at this patch. I agree that it accounts the memory usage more
> accurately. Here are few comments.

Thanks for the review!

> spaceUsed is defined with comment
> SizespaceUsed;/* memory space currently used by tuples */
>
> In ExecHashTableCreate(), although the space is allocated for buckets, no
> tuples are yet inserted, so no space is used by the tuples, so going strictly
> by the comment, spaceUsed should be 0 in that function. But I think the patch
> is accounting the spaceUsed more accurately. Without this patch, the actual
> allocation might cross spaceAllowed without being noticed. With this patch
> that's not possible. Probably we should change the comment to say memory space
> currently allocated.

Right, that comment is out of date.  It is now the space used by the
bucket array and the tuples.  I will fix that in the next version.

> However, ExecHashIncreaseNumBatches() may change the
> number of buckets; the patch does not seem to account for spaceUsed changes
> because of that.

That's what this hunk is intended to do:

@@ -795,6 +808,12 @@ ExecHashIncreaseNumBuckets(HashJoinTable hashtable)
TRACE_POSTGRESQL_HASH_INCREASE_BUCKETS(hashtable->nbuckets,

hashtable->nbuckets_optimal);

+   /* account for the increase in space that will be used by buckets */
+   hashtable->spaceUsed += sizeof(HashJoinTuple) *
+   (hashtable->nbuckets_optimal - hashtable->nbuckets);
+   if (hashtable->spaceUsed > hashtable->spacePeak)
+   hashtable->spacePeak = hashtable->spaceUsed;
+
hashtable->nbuckets = hashtable->nbuckets_optimal;
hashtable->log2_nbuckets = hashtable->log2_nbuckets_optimal;

It knows that spaceUsed already includes the old bucket array
(nbuckets), so it figures out how much bigger the new bucket array
will be (nbuckets_optimal - nbuckets) and adds that.

> Without this patch ExecHashTableInsert() used to account for the space used by
> a single tuple inserted. The patch moves this calculation in dense_alloc() and
> accounts for out-of-bound allocation for larger tuples. That's good.
>
> The change in ExecChooseHashTableSize() too looks fine.
>
> In ExecHashTableReset(), do we want to update spacePeak while setting
> spaceUsed.

I figured there was no way that the new spaceUsed value could be
bigger than spacePeak, because we threw out all chunks and have just
the bucket array, and we had that number of buckets before, so
spacePeak must at least have been set to a number >= this number
either when we expanded to this many buckets, or when we created the
hashtable in the first place.  Perhaps I should
Assert(hashtable->spaceUsed <= hashtable->spacePeak).

> While this patch tracks space usage more accurately, I am afraid we might be
> overdoing it; a reason why we don't track space usage accurately now. But I
> think I will leave it to be judged by someone who is more familiar with the
> code and possibly has historical perspective.

Well it's not doing more work; it doesn't make any practical
difference whatsoever but it's technically doing less work than
master, by doing memory accounting only when acquiring a new 32KB
chunk.  But if by overdoing it you mean that no one really cares about
the tiny increase in accuracy so the patch on its own is a bit of a
waste of time, you're probably right.  Depending on tuple size, you
could imagine something like 64 bytes of header and unused space per
32KB chunk that we're not accounting for, and that's only 0.2%.  So I
probably wouldn't propose this refactoring just on accuracy grounds
alone.

This refactoring is intended to pave the way for shared memory
accounting in the later patches, which would otherwise generate ugly
IPC if done for every time a tuple is allocated.  I considered using
atomic add to count space per tuple, or maintaining per-backend local
subtotals and periodically summing.  Then I realised that switching to
per-chunk accounting would fix the IPC problem AND be justifiable on
theoretical grounds.  When we allocate a new 32KB chunk, we really are
using 32KB more of your memory.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-01-31 Thread Ashutosh Bapat
>
> 0003-hj-refactor-memory-accounting-v4.patch:
>
> Modify the existing hash join code to work in terms of chunks when
> estimating and later tracking memory usage.  This is probably more
> accurate than the current tuple-based approach, because it tries to
> take into account the space used by chunk headers and the wasted space
> in chunks.  In practice the difference is probably small, but it's
> arguably more accurate;  I did this because I need chunk-based
> accounting the later patches.  Also, make HASH_CHUNK_SIZE the actual
> size of allocated chunks (ie the header information is included in
> that size so we allocate exactly 32KB, not 32KB + a bit, for the
> benefit of the dsa allocator which otherwise finishes up allocating
> 36KB).
>
I looked at this patch. I agree that it accounts the memory usage more
accurately. Here are few comments.

spaceUsed is defined with comment
SizespaceUsed;/* memory space currently used by tuples */

In ExecHashTableCreate(), although the space is allocated for buckets, no
tuples are yet inserted, so no space is used by the tuples, so going strictly
by the comment, spaceUsed should be 0 in that function. But I think the patch
is accounting the spaceUsed more accurately. Without this patch, the actual
allocation might cross spaceAllowed without being noticed. With this patch
that's not possible. Probably we should change the comment to say memory space
currently allocated. However, ExecHashIncreaseNumBatches() may change the
number of buckets; the patch does not seem to account for spaceUsed changes
because of that.

Without this patch ExecHashTableInsert() used to account for the space used by
a single tuple inserted. The patch moves this calculation in dense_alloc() and
accounts for out-of-bound allocation for larger tuples. That's good.

The change in ExecChooseHashTableSize() too looks fine.

In ExecHashTableReset(), do we want to update spacePeak while setting
spaceUsed.

While this patch tracks space usage more accurately, I am afraid we might be
overdoing it; a reason why we don't track space usage accurately now. But I
think I will leave it to be judged by someone who is more familiar with the
code and possibly has historical perspective.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-01-30 Thread Michael Paquier
On Sat, Jan 28, 2017 at 10:03 AM, Thomas Munro
 wrote:
> I have broken this up into a patch series, harmonised the private vs
> shared hash table code paths better and fixed many things including
> the problems with rescans and regression tests mentioned upthread.
> You'll see that one of the patches is that throwaway BufFile
> import/export facility, which I'll replace with your code as
> discussed.

Patch moved to CF 2017-03.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-01-29 Thread Thomas Munro
On Sat, Jan 7, 2017 at 9:01 AM, Thomas Munro
 wrote:
> Stepping back a bit, I am aware of the following approaches to hash
> join parallelism:
>
> 1.  Run the inner plan and build a private hash table in each
> participant [...].
>
> 2.  Run a partition-wise hash join[1].  [...]
>
> 3.  Repartition the data on the fly, and then run a partition-wise
> hash join.  [...]
>
> 4.  Scatter both the inner and outer plans arbitrarily across
> participants [...], and build a shared hash
> table.  [...]
>
> [...] I suspect that 4 is probably a better
> fit than 3 for Postgres today, because the communication overhead of
> shovelling nearly all tuples through extra tuple queues to route them
> to the right hash table would surely be very high, though I can see
> that it's very attractive to have a reusable tuple repartitioning
> operator and then run k disjoint communication-free joins (again,
> without code change to the join operator, and to the benefit of all
> join operators).

On this topic, I recently stumbled on the 2011 paper "Design and
Evaluation of Main Memory Hash Join Algorithms for Multi-core CPUs"[1]
and found it reassuring.  It compares simple shared hash tables to
some state-of-the-art repartitioning approaches (including the radix
join algorithm which performs the amazing feat of building a lot of
cacheline-sized hash tables and then runs with minimal cache misses).

From the introduction:

"Second, we show that an algorithm that does not do any partitioning,
but simply constructs a single shared hash table on the build relation
often outperforms more complex algorithms. This simple
“no-partitioning” hash join algorithm is robust to sub-optimal
parameter choices by the optimizer, and does not require any knowledge
of the characteristics of the input to work well. To the best of our
knowledge, this simple hash join technique differs from what is
currently implemented in existing DBMSs for multi-core hash join
processing, and offers a tantalizingly simple, efficient, and robust
technique for implementing the hash join operation."

"Finally, we show that the simple “no-partitioning” hash join
algorithm takes advantage of intrinsic hardware optimizations to
handle skew. As a result, this simple hash join technique often
benefits from skew and its relative performance increases as the skew
increases! This property is a big advancement over the
state-of-the-art methods, as it is important to have methods that can
gracefully handle skew in practice [8]."

(That relates to SMT pipelining compensating for the extra cacheline
misses during probing by doing thread A's work while waiting for
thread B's memory to be fetched.)

From the conclusion:

"... Our results show that a simple hash join technique that does not
do any partitioning of the input relations often outperforms the other
more complex partitioning-based join alternatives. In addition, the
relative performance of this simple hash join technique rapidly
improves with increasing skew, and it outperforms every other
algorithm in the presence of even small amounts of skew."

For balance, the authors of a 2013 paper "Main-Memory Hash Joins on
Multi-Core CPUs: Tuning to the Underlying Hardware"[2] are less keen
on the simple "hardware-oblivious" "no partitioning" approach and
don't buy the other paper's ideas about SMT.   Incidentally, their
results on the benefits of large (huge) pages are interesting (table
VI) and suggest that huge page support for DSM segments could be good
here.

[1] 
https://pdfs.semanticscholar.org/9de4/b32f2c7b630a4f6aae6994a362a46c7c49e9.pdf
[2] 
https://www.inf.ethz.ch/personal/cagri.balkesen/publications/parallel-joins-icde13.pdf

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-01-27 Thread Peter Geoghegan
Hi Thomas,

On Fri, Jan 27, 2017 at 5:03 PM, Thomas Munro
 wrote:
> I have broken this up into a patch series, harmonised the private vs
> shared hash table code paths better and fixed many things including
> the problems with rescans and regression tests mentioned upthread.
> You'll see that one of the patches is that throwaway BufFile
> import/export facility, which I'll replace with your code as
> discussed.

I'll try to get back to this ASAP, but expect to be somewhat busy next
week. Next week will be my last week at Heroku.

It was not an easy decision for me to leave Heroku, but I felt it was
time for a change. I am very grateful to have had the opportunity. I
have learned an awful lot during my time at the company. It has been
excellent to have an employer that has been so supportive of my work
on Postgres this whole time.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-01-27 Thread Thomas Munro
On Fri, Jan 13, 2017 at 2:36 PM, Peter Geoghegan  wrote:
> [...]
> Yeah. That's basically what the BufFile unification process can
> provide you with (or will, once I get around to implementing the
> refcount thing, which shouldn't be too hard). As already noted, I'll
> also want to make it defer creation of a leader-owned segment, unless
> and until that proves necessary, which it never will for hash join.

Hi Peter,

I have broken this up into a patch series, harmonised the private vs
shared hash table code paths better and fixed many things including
the problems with rescans and regression tests mentioned upthread.
You'll see that one of the patches is that throwaway BufFile
import/export facility, which I'll replace with your code as
discussed.

The three 'refactor' patches change the existing hash join code to
work in terms of chunks in more places.  These may be improvements in
their own right, but mainly they pave the way for parallelism.  The
later patches introduce single-batch and then multi-batch shared
tables.

The patches in the attached tarball are:

0001-nail-down-regression-test-row-order-v4.patch:

A couple of regression tests would fail with later refactoring that
changes the order of unmatched rows emitted by hash joins.  So first,
let's fix that by adding ORDER BY in those places, without any code
changes.

0002-hj-add-dtrace-probes-v4.patch:

Before making any code changes, let's add some dtrace probes so that
we can measure time spent doing different phases of hash join work
before and after the later changes.  The main problem with the probes
as I have them here (and the extra probes inserted by later patches in
the series) is that interesting query plans contain multiple hash
joins so these get all mixed up when you're trying to measure stuff,
so perhaps I should pass executor node IDs into all the probes.  More
on this later.  (If people don't want dtrace probes in the executor,
I'm happy to omit them and maintain that kind of thing locally for my
own testing purposes...)

0003-hj-refactor-memory-accounting-v4.patch:

Modify the existing hash join code to work in terms of chunks when
estimating and later tracking memory usage.  This is probably more
accurate than the current tuple-based approach, because it tries to
take into account the space used by chunk headers and the wasted space
in chunks.  In practice the difference is probably small, but it's
arguably more accurate;  I did this because I need chunk-based
accounting the later patches.  Also, make HASH_CHUNK_SIZE the actual
size of allocated chunks (ie the header information is included in
that size so we allocate exactly 32KB, not 32KB + a bit, for the
benefit of the dsa allocator which otherwise finishes up allocating
36KB).

0004-hj-refactor-batch-increases-v4.patch:

Modify the existing hash join code to detect work_mem exhaustion at
the point where chunks are allocated, instead of checking after every
tuple insertion.  This matches the logic used for estimating, and more
importantly allows for some parallelism in later patches.

0005-hj-refactor-unmatched-v4.patch:

Modifies the existing hash join code to handle unmatched tuples in
right/full joins chunk-by-chunk.  This is probably a cache-friendlier
scan order anyway, but the real goal is to provide a natural grain for
parallelism in a later patch.

0006-hj-barrier-v4.patch:

The patch from a nearby thread previously presented as a dependency of
this project.  It might as well be considered part of this patch
series.

0007-hj-exec-detach-node-v4.patch

By the time ExecEndNode() runs in workers, ExecShutdownNode() has
already run.  That's done on purpose because, for example, the hash
table needs to survive longer than the parallel environment to allow
EXPLAIN to peek at it.  But it means that the Gather node has thrown
out the shared memory before any parallel-aware node below it gets to
run its Shutdown and End methods.  So I invented ExecDetachNode()
which runs before ExecShutdownNode(), giving parallel-aware nodes a
chance to say goodbye before their shared memory vanishes.  Better
ideas?

0008-hj-shared-single-batch-v4.patch:

Introduces hash joins with "Shared Hash" and "Parallel Shared Hash"
nodes, for single-batch joins only.  If the planner has a partial
inner plan, it'll pick a Parallel Shared Hash plan to divide that over
K participants.  Failing that, if the planner has a parallel-safe
inner plan and thinks that it can avoid batching by using work_mem * K
memory (shared by all K participants), it will now use a Shared Hash.
Otherwise it'll typically use a Hash plan as before.  Without the
later patches, it will blow through work_mem * K if it turns out to
have underestimated the hash table size, because it lacks
infrastructure for dealing with batches.

The trickiest thing at this point in the series is that participants
(workers and the leader) can show up at any time, so there are three
places that provide synchronisation with a parallel 

Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-01-12 Thread Peter Geoghegan
On Wed, Jan 11, 2017 at 7:37 PM, Thomas Munro
 wrote:
> Hmm.  Yes, that is an entirely bogus use of isInterXact.  I am
> thinking about how to fix that with refcounts.

Cool. As I said, the way I'd introduce refcounts would not be very
different from what I've already done -- there'd still be a strong
adherence to the use of resource managers to clean-up, with that
including exactly one particular backend doing the extra step of
deletion. The refcount only changes which backend does that extra step
in corner cases, which is conceptually a very minor change.

>> I don't think it's right for buffile.c to know anything about file
>> paths directly -- I'd say that that's a modularity violation.
>> PathNameOpenFile() is called by very few callers at the moment, all of
>> them very low level (e.g. md.c), but you're using it within buffile.c
>> to open a path to the file that you obtain from shared memory
>
> Hmm.  I'm not seeing the modularity violation.  buffile.c uses
> interfaces already exposed by fd.c to do this:  OpenTemporaryFile,
> then FilePathName to find the path, then PathNameOpenFile to open from
> another process.  I see that your approach instead has client code
> provide more meta data so that things can be discovered, which may
> well be a much better idea.

Indeed, my point was that the metadata thing would IMV be better.
buffile.c shouldn't need to know about file paths, etc. Instead,
caller should pass BufFileImport()/BufFileUnify() simple private state
sufficient for routine to discover all details itself, based on a
deterministic scheme. In my tuplesort patch, that piece of state is:

 /*
+ * BufFileOp is an identifier for a particular parallel operation involving
+ * temporary files.  Parallel temp file operations must be discoverable across
+ * processes based on these details.
+ *
+ * These fields should be set by BufFileGetIdent() within leader process.
+ * Identifier BufFileOp makes temp files from workers discoverable within
+ * leader.
+ */
+typedef struct BufFileOp
+{
+   /*
+* leaderPid is leader process PID.
+*
+* tempFileIdent is an identifier for a particular temp file (or parallel
+* temp file op) for the leader.  Needed to distinguish multiple parallel
+* temp file operations within a given leader process.
+*/
+   int leaderPid;
+   longtempFileIdent;
+} BufFileOp;
+

> Right, that is a problem.  A refcount mode could fix that; virtual
> file descriptors would be closed in every backend using the current
> resource owner, and the files would be deleted when the last one turns
> out the lights.

Yeah. That's basically what the BufFile unification process can
provide you with (or will, once I get around to implementing the
refcount thing, which shouldn't be too hard). As already noted, I'll
also want to make it defer creation of a leader-owned segment, unless
and until that proves necessary, which it never will for hash join.

Perhaps I should make superficial changes to unification in my patch
to suit your work, like rename the field BufFileOp.leaderPid to
BufFileOp.ownerPid, without actually changing any behaviors, except as
noted in the last paragraph. Since you only require that backends be
able to open up some other backend's temp file themselves for a short
while, that gives you everything you need. You'll be doing unification
in backends, and not just within the leader as in the tuplesort patch,
I believe, but that's just fine. All that matters is that you present
all data at once to a consuming backend via unification (since you
treat temp file contents as immutable, this will be true for hash
join, just as it is for tuplesort).

There is a good argument against my making such a tweak, however,
which is that maybe it's clearer to DBAs what's going on if temp file
names have the leader PID in them for all operations. So, maybe
BufFileOp.leaderPid isn't renamed to BufFileOp.ownerPid by me;
instead, you always make it the leader pid, while at the same time
having the leader dole out BufFileOp.tempFileIdent identifiers to each
worker as needed (see how I generate BufFileOps for an idea of what I
mean if it's not immediately clear). That's also an easy change, or at
least will be once the refcount thing is added.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-01-12 Thread Rafia Sabih
On Thu, Jan 12, 2017 at 9:07 AM, Thomas Munro  wrote:

> On Wed, Jan 11, 2017 at 2:56 PM, Peter Geoghegan  wrote:
> > On Fri, Jan 6, 2017 at 12:01 PM, Thomas Munro
> >  wrote:
> >> Here is a new WIP patch.  I have plenty of things to tidy up (see note
> >> at end), but the main ideas are now pretty clear and I'd appreciate
> >> some feedback.
> >
> > I have some review feedback for your V3. I've chosen to start with the
> > buffile.c stuff, since of course it might share something with my
> > parallel tuplesort patch. This isn't comprehensive, but I will have
> > more comprehensive feedback soon.
>
> Thanks!
>
> > I'm not surprised that you've generally chosen to make shared BufFile
> > management as simple as possible, with no special infrastructure other
> > than the ability to hold open other backend temp files concurrently
> > within a worker, and no writing to another worker's temp file, or
> > shared read pointer. As you put it, everything is immutable. I
> > couldn't see much opportunity for adding a lot of infrastructure that
> > wasn't written explicitly as parallel hash join code/infrastructure.
> > My sense is that that was a good decision. I doubted that you'd ever
> > want some advanced, generic shared BufFile thing with multiple read
> > pointers, built-in cache coherency, etc. (Robert seemed to think that
> > you'd go that way, though.)
>
> Right, this is extremely minimalist infrastructure.  fd.c is
> unchanged.  buffile.c only gains the power to export/import read-only
> views of BufFiles.  There is no 'unification' of BufFiles: each hash
> join participant simply reads from the buffile it wrote, and then
> imports and reads from its peers' BufFiles, until all are exhausted;
> so the 'unification' is happening in caller code which knows about the
> set of participants and manages shared read positions.  Clearly there
> are some ownership/cleanup issues to straighten out, but I think those
> problems are fixable (probably involving refcounts).
>
> I'm entirely willing to throw that away and use the unified BufFile
> concept, if it can be extended to support multiple readers of the
> data, where every participant unifies the set of files.  I have so far
> assumed that it would be most efficient for each participant to read
> from the file that it wrote before trying to read from files written
> by other participants.  I'm reading your patch now; more soon.
>
> > Anyway, some more specific observations:
> >
> > * ISTM that this is the wrong thing for shared BufFiles:
> >
> >> +BufFile *
> >> +BufFileImport(BufFileDescriptor *descriptor)
> >> +{
> > ...
> >> +   file->isInterXact = true; /* prevent cleanup by this backend */
> >
> > There is only one user of isInterXact = true BufFiles at present,
> > tuplestore.c. It, in turn, only does so for cases that require
> > persistent tuple stores. A quick audit of these tuplestore.c callers
> > show this to just be cursor support code within portalmem.c. Here is
> > the relevant tuplestore_begin_heap() rule that that code adheres to,
> > unlike the code I've quoted above:
> >
> >  * interXact: if true, the files used for on-disk storage persist beyond
> the
> >  * end of the current transaction.  NOTE: It's the caller's
> responsibility to
> >  * create such a tuplestore in a memory context and resource owner that
> will
> >  * also survive transaction boundaries, and to ensure the tuplestore is
> closed
> >  * when it's no longer wanted.
>
> Hmm.  Yes, that is an entirely bogus use of isInterXact.  I am
> thinking about how to fix that with refcounts.
>
> > I don't think it's right for buffile.c to know anything about file
> > paths directly -- I'd say that that's a modularity violation.
> > PathNameOpenFile() is called by very few callers at the moment, all of
> > them very low level (e.g. md.c), but you're using it within buffile.c
> > to open a path to the file that you obtain from shared memory
>
> Hmm.  I'm not seeing the modularity violation.  buffile.c uses
> interfaces already exposed by fd.c to do this:  OpenTemporaryFile,
> then FilePathName to find the path, then PathNameOpenFile to open from
> another process.  I see that your approach instead has client code
> provide more meta data so that things can be discovered, which may
> well be a much better idea.
>
> > directly. This is buggy because the following code won't be reached in
> > workers that call your BufFileImport() function:
> >
> > /* Mark it for deletion at close */
> > VfdCache[file].fdstate |= FD_TEMPORARY;
> >
> > /* Register it with the current resource owner */
> > if (!interXact)
> > {
> > VfdCache[file].fdstate |= FD_XACT_TEMPORARY;
> >
> > ResourceOwnerEnlargeFiles(CurrentResourceOwner);
> > ResourceOwnerRememberFile(CurrentResourceOwner, file);
> > VfdCache[file].resowner = CurrentResourceOwner;
> >
> > /* ensure cleanup happens 

Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-01-11 Thread Thomas Munro
On Wed, Jan 11, 2017 at 2:56 PM, Peter Geoghegan  wrote:
> On Fri, Jan 6, 2017 at 12:01 PM, Thomas Munro
>  wrote:
>> Here is a new WIP patch.  I have plenty of things to tidy up (see note
>> at end), but the main ideas are now pretty clear and I'd appreciate
>> some feedback.
>
> I have some review feedback for your V3. I've chosen to start with the
> buffile.c stuff, since of course it might share something with my
> parallel tuplesort patch. This isn't comprehensive, but I will have
> more comprehensive feedback soon.

Thanks!

> I'm not surprised that you've generally chosen to make shared BufFile
> management as simple as possible, with no special infrastructure other
> than the ability to hold open other backend temp files concurrently
> within a worker, and no writing to another worker's temp file, or
> shared read pointer. As you put it, everything is immutable. I
> couldn't see much opportunity for adding a lot of infrastructure that
> wasn't written explicitly as parallel hash join code/infrastructure.
> My sense is that that was a good decision. I doubted that you'd ever
> want some advanced, generic shared BufFile thing with multiple read
> pointers, built-in cache coherency, etc. (Robert seemed to think that
> you'd go that way, though.)

Right, this is extremely minimalist infrastructure.  fd.c is
unchanged.  buffile.c only gains the power to export/import read-only
views of BufFiles.  There is no 'unification' of BufFiles: each hash
join participant simply reads from the buffile it wrote, and then
imports and reads from its peers' BufFiles, until all are exhausted;
so the 'unification' is happening in caller code which knows about the
set of participants and manages shared read positions.  Clearly there
are some ownership/cleanup issues to straighten out, but I think those
problems are fixable (probably involving refcounts).

I'm entirely willing to throw that away and use the unified BufFile
concept, if it can be extended to support multiple readers of the
data, where every participant unifies the set of files.  I have so far
assumed that it would be most efficient for each participant to read
from the file that it wrote before trying to read from files written
by other participants.  I'm reading your patch now; more soon.

> Anyway, some more specific observations:
>
> * ISTM that this is the wrong thing for shared BufFiles:
>
>> +BufFile *
>> +BufFileImport(BufFileDescriptor *descriptor)
>> +{
> ...
>> +   file->isInterXact = true; /* prevent cleanup by this backend */
>
> There is only one user of isInterXact = true BufFiles at present,
> tuplestore.c. It, in turn, only does so for cases that require
> persistent tuple stores. A quick audit of these tuplestore.c callers
> show this to just be cursor support code within portalmem.c. Here is
> the relevant tuplestore_begin_heap() rule that that code adheres to,
> unlike the code I've quoted above:
>
>  * interXact: if true, the files used for on-disk storage persist beyond the
>  * end of the current transaction.  NOTE: It's the caller's responsibility to
>  * create such a tuplestore in a memory context and resource owner that will
>  * also survive transaction boundaries, and to ensure the tuplestore is closed
>  * when it's no longer wanted.

Hmm.  Yes, that is an entirely bogus use of isInterXact.  I am
thinking about how to fix that with refcounts.

> I don't think it's right for buffile.c to know anything about file
> paths directly -- I'd say that that's a modularity violation.
> PathNameOpenFile() is called by very few callers at the moment, all of
> them very low level (e.g. md.c), but you're using it within buffile.c
> to open a path to the file that you obtain from shared memory

Hmm.  I'm not seeing the modularity violation.  buffile.c uses
interfaces already exposed by fd.c to do this:  OpenTemporaryFile,
then FilePathName to find the path, then PathNameOpenFile to open from
another process.  I see that your approach instead has client code
provide more meta data so that things can be discovered, which may
well be a much better idea.

> directly. This is buggy because the following code won't be reached in
> workers that call your BufFileImport() function:
>
> /* Mark it for deletion at close */
> VfdCache[file].fdstate |= FD_TEMPORARY;
>
> /* Register it with the current resource owner */
> if (!interXact)
> {
> VfdCache[file].fdstate |= FD_XACT_TEMPORARY;
>
> ResourceOwnerEnlargeFiles(CurrentResourceOwner);
> ResourceOwnerRememberFile(CurrentResourceOwner, file);
> VfdCache[file].resowner = CurrentResourceOwner;
>
> /* ensure cleanup happens at eoxact */
> have_xact_temporary_files = true;
> }

Right, that is a problem.  A refcount mode could fix that; virtual
file descriptors would be closed in every backend using the current
resource owner, and the files would be deleted when the last one turns
out the 

Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-01-11 Thread Peter Geoghegan
On Wed, Jan 11, 2017 at 12:05 PM, Robert Haas  wrote:
> On Wed, Jan 11, 2017 at 2:20 PM, Peter Geoghegan  wrote:
>> You'd probably still want to throw an error when workers ended up not
>> deleting BufFile segments they owned, though, at least for parallel
>> tuplesort.
>
> Don't see why.

Simply because that's not expected as things stand -- why should the
file go away in that context? (Admittedly, that doesn't seem like an
excellent reason now.)

I actually like the idea of a reference count, the more I think about
it, since it doesn't actually have any tension with my original idea
of ownership. If something like a randomAccess parallel tuplesort
leader merge needs to write new segments (which it almost certainly
*won't* anyway, due to my recent V7 changes), then it can still own
those new segments itself, alone, and delete them on its own in the
manner of conventional temp files, because we can still restrict the
shared refcount mechanism to the deletion of "initial" segments. The
refcount == 0 deleter only deletes those initial segments, and not any
same-BufFile segments that might have been added (added to append to
our unified BufFile within leader). I think that parallel hash join
won't use this at all, and, as I said, it's only a theoretical
requirement for parallel tuplesort, which will generally recycle
blocks from worker temp files for its own writes all the time for
randomAccess cases, the only cases that ever write within logtape.c.

So, the only BufFile shared state needed, that must be maintained over
time, is the refcount variable itself. The size of the "initial"
BufFile (from which we derive number of new segments during
unification) is passed, but it doesn't get maintained in shared
memory. BufFile size remains a one way, one time message needed during
unification. I only really need to tweak things in fd.c temp routines
to make all this work.

This is something I like because it makes certain theoretically useful
things easier. Say, for example, we wanted to have tuplesort workers
merge worker final materialized tapes (their final output), in order
to arrange for the leader to have fewer than $NWORKER runs to merge at
the end -- that's made easier by the refcount stuff. (I'm still not
convinced that that's actually going to make CREATE INDEX faster.
Still, it should, on general principle, be easy to write a patch that
makes it happen -- a good overall design should leave things so that
writing that prototype patch is easy.)

>> This idea is something that's much more limited than the
>> SharedTemporaryFile() API that you sketched on the parallel sort
>> thread, because it only concerns resource management, and not how to
>> make access to the shared file concurrency safe in any special,
>> standard way.
>
> Actually, I only intended that sketch to be about resource management.
> Sounds like I didn't explain very well.

I'm glad to hear that, because I was very puzzled by what you said. I
guess I was thrown off by "shared read pointers". I don't want to get
into the business of flushing out dirty buffers, or making sure that
every backend stays in lockstep about what the total size of the
BufFile needs to be. It's so much simpler to just have clear
"barriers" for each parallel operation, where backends present a large
amount of immutable state to one other backend at the end, and tells
it how big its BufFile is only once. (It's not quite immutable, since
randomAccess recycle of temp files can happen within logtape.c, but
the point is that there should be very little back and forth -- that
needs to be severely restricted.)

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-01-11 Thread Peter Geoghegan
On Wed, Jan 11, 2017 at 11:20 AM, Peter Geoghegan  wrote:
>> If multiple processes are using the same file via the BufFile
>> interface, I think that it is absolutely necessary that there should
>> be a provision to track the "attach count" of the BufFile.  Each
>> process that reaches EOXact decrements the attach count and when it
>> reaches 0, the process that reduced it to 0 removes the BufFile.  I
>> think anything that's based on the notion that leaders will remove
>> files and workers won't is going to be fragile and limiting, and I am
>> going to push hard against any such proposal.
>
> Okay. My BufFile unification approach happens to assume that backends
> clean up after themselves, but that isn't a ridged assumption (of
> course, these are always temp files, so we reason about them as temp
> files).

Also, to be clear, and to avoid confusion: I don't think anyone wants
an approach "that's based on the notion that leaders will remove files
and workers won't". All that has been suggested is that the backend
that creates the file should be responsible for deleting the file, by
definition. And, that any other backend that may have files owned by
another backend must be sure to not try to access them after the owner
deletes them. (Typically, that would be ensured by some barrier
condition, some dependency, inherent to how the parallel operation is
implemented.)

I will implement the reference count thing.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-01-11 Thread Robert Haas
On Wed, Jan 11, 2017 at 2:20 PM, Peter Geoghegan  wrote:
> You'd probably still want to throw an error when workers ended up not
> deleting BufFile segments they owned, though, at least for parallel
> tuplesort.

Don't see why.

> This idea is something that's much more limited than the
> SharedTemporaryFile() API that you sketched on the parallel sort
> thread, because it only concerns resource management, and not how to
> make access to the shared file concurrency safe in any special,
> standard way.

Actually, I only intended that sketch to be about resource management.
Sounds like I didn't explain very well.

> Instead, they should be passing around some kind of minimal
> private-to-buffile state in shared memory that coordinates backends
> participating in BufFile unification. Private state created by
> buffile.c, and passed back to buffile.c. Everything should be
> encapsulated within buffile.c, IMV, making parallel implementations as
> close as possible to their serial implementations.

That seems reasonable although I haven't studied the details carefully as yet.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-01-11 Thread Peter Geoghegan
On Wed, Jan 11, 2017 at 10:57 AM, Robert Haas  wrote:
> On Tue, Jan 10, 2017 at 8:56 PM, Peter Geoghegan  wrote:
>> Instead of all this, I suggest copying some of my changes to fd.c, so
>> that resource ownership within fd.c differentiates between a vfd that
>> is owned by the backend in the conventional sense, including having a
>> need to delete at eoxact, as well as a lesser form of ownership where
>> deletion should not happen.
>
> If multiple processes are using the same file via the BufFile
> interface, I think that it is absolutely necessary that there should
> be a provision to track the "attach count" of the BufFile.  Each
> process that reaches EOXact decrements the attach count and when it
> reaches 0, the process that reduced it to 0 removes the BufFile.  I
> think anything that's based on the notion that leaders will remove
> files and workers won't is going to be fragile and limiting, and I am
> going to push hard against any such proposal.

Okay. My BufFile unification approach happens to assume that backends
clean up after themselves, but that isn't a ridged assumption (of
course, these are always temp files, so we reason about them as temp
files). It could be based on a refcount fairly easily, such that, as
you say here, deletion of files occurs within workers (that "own" the
files) only as a consequence of their being the last backend with a
reference, that must therefore "turn out the lights" (delete the
file). That seems consistent with what I've done within fd.c, and what
I suggested to Thomas (that he more or less follow that approach).
You'd probably still want to throw an error when workers ended up not
deleting BufFile segments they owned, though, at least for parallel
tuplesort.

This idea is something that's much more limited than the
SharedTemporaryFile() API that you sketched on the parallel sort
thread, because it only concerns resource management, and not how to
make access to the shared file concurrency safe in any special,
standard way. I think that this resource management is something that
should be managed by buffile.c (and the temp file routines within fd.c
that are morally owned by buffile.c, their only caller). It shouldn't
be necessary for a client of this new infrastructure, such as parallel
tuplesort or parallel hash join, to know anything about file paths.
Instead, they should be passing around some kind of minimal
private-to-buffile state in shared memory that coordinates backends
participating in BufFile unification. Private state created by
buffile.c, and passed back to buffile.c. Everything should be
encapsulated within buffile.c, IMV, making parallel implementations as
close as possible to their serial implementations.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-01-11 Thread Robert Haas
On Tue, Jan 10, 2017 at 8:56 PM, Peter Geoghegan  wrote:
> Instead of all this, I suggest copying some of my changes to fd.c, so
> that resource ownership within fd.c differentiates between a vfd that
> is owned by the backend in the conventional sense, including having a
> need to delete at eoxact, as well as a lesser form of ownership where
> deletion should not happen.

If multiple processes are using the same file via the BufFile
interface, I think that it is absolutely necessary that there should
be a provision to track the "attach count" of the BufFile.  Each
process that reaches EOXact decrements the attach count and when it
reaches 0, the process that reduced it to 0 removes the BufFile.  I
think anything that's based on the notion that leaders will remove
files and workers won't is going to be fragile and limiting, and I am
going to push hard against any such proposal.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-01-10 Thread Peter Geoghegan
On Fri, Jan 6, 2017 at 12:01 PM, Thomas Munro
 wrote:
> Here is a new WIP patch.  I have plenty of things to tidy up (see note
> at end), but the main ideas are now pretty clear and I'd appreciate
> some feedback.

I have some review feedback for your V3. I've chosen to start with the
buffile.c stuff, since of course it might share something with my
parallel tuplesort patch. This isn't comprehensive, but I will have
more comprehensive feedback soon.

I'm not surprised that you've generally chosen to make shared BufFile
management as simple as possible, with no special infrastructure other
than the ability to hold open other backend temp files concurrently
within a worker, and no writing to another worker's temp file, or
shared read pointer. As you put it, everything is immutable. I
couldn't see much opportunity for adding a lot of infrastructure that
wasn't written explicitly as parallel hash join code/infrastructure.
My sense is that that was a good decision. I doubted that you'd ever
want some advanced, generic shared BufFile thing with multiple read
pointers, built-in cache coherency, etc. (Robert seemed to think that
you'd go that way, though.)

Anyway, some more specific observations:

* ISTM that this is the wrong thing for shared BufFiles:

> +BufFile *
> +BufFileImport(BufFileDescriptor *descriptor)
> +{
...
> +   file->isInterXact = true; /* prevent cleanup by this backend */

There is only one user of isInterXact = true BufFiles at present,
tuplestore.c. It, in turn, only does so for cases that require
persistent tuple stores. A quick audit of these tuplestore.c callers
show this to just be cursor support code within portalmem.c. Here is
the relevant tuplestore_begin_heap() rule that that code adheres to,
unlike the code I've quoted above:

 * interXact: if true, the files used for on-disk storage persist beyond the
 * end of the current transaction.  NOTE: It's the caller's responsibility to
 * create such a tuplestore in a memory context and resource owner that will
 * also survive transaction boundaries, and to ensure the tuplestore is closed
 * when it's no longer wanted.

I don't think it's right for buffile.c to know anything about file
paths directly -- I'd say that that's a modularity violation.
PathNameOpenFile() is called by very few callers at the moment, all of
them very low level (e.g. md.c), but you're using it within buffile.c
to open a path to the file that you obtain from shared memory
directly. This is buggy because the following code won't be reached in
workers that call your BufFileImport() function:

/* Mark it for deletion at close */
VfdCache[file].fdstate |= FD_TEMPORARY;

/* Register it with the current resource owner */
if (!interXact)
{
VfdCache[file].fdstate |= FD_XACT_TEMPORARY;

ResourceOwnerEnlargeFiles(CurrentResourceOwner);
ResourceOwnerRememberFile(CurrentResourceOwner, file);
VfdCache[file].resowner = CurrentResourceOwner;

/* ensure cleanup happens at eoxact */
have_xact_temporary_files = true;
}

Certainly, you don't want the "Mark it for deletion at close" bit.
Deletion should not happen at eoxact for non-owners-but-sharers
(within FileClose()), but you *do* want CleanupTempFiles() to call
FileClose() for the virtual file descriptors you've opened in the
backend, to do some other cleanup. In general, you want to buy into
resource ownership for workers. As things stand, I think that this
will leak virtual file descriptors. That's really well hidden because
there is a similar CleanupTempFiles() call at proc exit, I think.
(Didn't take the time to make sure that that's what masked problems.
I'm sure that you want minimal divergence with serial cases,
resource-ownership-wise, in any case.)

Instead of all this, I suggest copying some of my changes to fd.c, so
that resource ownership within fd.c differentiates between a vfd that
is owned by the backend in the conventional sense, including having a
need to delete at eoxact, as well as a lesser form of ownership where
deletion should not happen. Maybe you'll end up using my BufFileUnify
interface [1] within workers (instead of just within the leader, as
with parallel tuplesort), and have it handle all of that for you.
Currently, that would mean that there'd be an unused/0 sized "local"
segment for the unified BufFile, but I was thinking of making that not
happen unless and until a new segment is actually needed, so even that
minor wart wouldn't necessarily affect you.

> Some assorted notes on the status:  I need to do some thinking about
> the file cleanup logic: both explicit deletes at the earliest possible
> time, and failure/error paths.  Currently the creator of each file is
> responsible for cleaning it up, but I guess if the creator aborts
> early the file disappears underneath the others' feet, and then I
> guess they might raise a confusing error report that races against the
> root cause error report; 

Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-01-09 Thread Thomas Munro
On Sat, Jan 7, 2017 at 9:01 AM, Thomas Munro
 wrote:
> On Tue, Jan 3, 2017 at 10:53 PM, Thomas Munro
>  wrote:
>> I will post a new rebased version soon with that and
>> some other nearby problems fixed.
>
> Here is a new WIP patch.

To make this easier to understand and harmonise the logic used in a
few places, I'm now planning to chop it up into a patch series,
probably something like this:

1.  Change existing hash join code to use chunk-based accounting
2.  Change existing hash join code to use a new interface for dealing
with batches
3.  Add shared hash join support, single batch only
4.  Add components for doing shared batch reading (unused)
5.  Add multi-batch shared hash join support

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-01-06 Thread Thomas Munro
On Sat, Jan 7, 2017 at 9:01 AM, Thomas Munro
 wrote:
> On Tue, Jan 3, 2017 at 10:53 PM, Thomas Munro
>  wrote:
>> I will post a new rebased version soon with that and
>> some other nearby problems fixed.
>
> Here is a new WIP patch.

I forgot to mention: this applies on top of barrier-v5.patch, over here:

https://www.postgresql.org/message-id/CAEepm%3D3g3EC734kgriWseiJPfUQZeoMWdhAfzOc0ecewAa5uXg%40mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-01-03 Thread Thomas Munro
On Mon, Jan 2, 2017 at 3:17 PM, Peter Geoghegan  wrote:
> I noticed a bug in your latest revision:
>
>> +   /*
>> +* In HJ_NEED_NEW_OUTER, we already selected the current inner batch for
>> +* reading from.  If there is a shared hash table, we may have already
>> +* partially loaded the hash table in ExecHashJoinPreloadNextBatch.
>> +*/
>> +   Assert(hashtable->batch_reader.batchno = curbatch);
>> +   Assert(hashtable->batch_reader.inner);
>
> Obviously this isn't supposed to be an assignment.

Right, thanks!  I will post a new rebased version soon with that and
some other nearby problems fixed.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-01-01 Thread Peter Geoghegan
On Sat, Dec 31, 2016 at 2:52 AM, Thomas Munro
 wrote:
> Unfortunately it's been a bit trickier than I anticipated to get the
> interprocess batch file sharing and hash table shrinking working
> correctly and I don't yet have a new patch in good enough shape to
> post in time for the January CF.  More soon.

I noticed a bug in your latest revision:

> +   /*
> +* In HJ_NEED_NEW_OUTER, we already selected the current inner batch for
> +* reading from.  If there is a shared hash table, we may have already
> +* partially loaded the hash table in ExecHashJoinPreloadNextBatch.
> +*/
> +   Assert(hashtable->batch_reader.batchno = curbatch);
> +   Assert(hashtable->batch_reader.inner);

Obviously this isn't supposed to be an assignment.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2016-12-31 Thread Thomas Munro
On Sat, Dec 3, 2016 at 1:38 AM, Haribabu Kommi  wrote:
> Moved to next CF with "waiting on author" status.

Unfortunately it's been a bit trickier than I anticipated to get the
interprocess batch file sharing and hash table shrinking working
correctly and I don't yet have a new patch in good enough shape to
post in time for the January CF.  More soon.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2016-12-02 Thread Haribabu Kommi
On Thu, Nov 3, 2016 at 4:19 PM, Thomas Munro 
wrote:

> Obviously I'm actively working on developing and stabilising all this.
> Some of the things I'm working on are: work_mem accounting, batch
> increases, rescans and figuring out if the resource management for
> those BufFiles is going to work.  There are quite a lot of edge cases
> some of which I'm still figuring out, but I feel like this approach is
> workable.  At this stage I want to share what I'm doing to see if
> others have feedback, ideas, blood curdling screams of horror, etc.  I
> will have better patches and a set of test queries soon.  Thanks for
> reading.
>
>
This patch doesn't receive any review. Patch is not applying properly to
HEAD.
Moved to next CF with "waiting on author" status.

Regards,
Hari Babu
Fujitsu Australia


[HACKERS] WIP: [[Parallel] Shared] Hash

2016-10-31 Thread Thomas Munro
Hi hackers,

In PostgreSQL 9.6, hash joins can be parallelised under certain
conditions, but a copy of the hash table is built in every
participating backend.  That means that memory and CPU time are
wasted.  In many cases, that's OK: if the hash table contents are
small and cheap to compute, then we don't really care, we're just
happy that the probing can be done in parallel.  But in cases where
the hash table is large and/or expensive to build, we could do much
better.  I am working on that problem.

To recap the situation in 9.6, a hash join can appear below a Gather
node and it looks much the same as a non-parallel hash join except
that it has a partial outer plan:

  ->  Hash Join
->  
->  Hash
  ->  

A partial plan is one that has some kind of 'scatter' operation as its
ultimate source of tuples.  Currently the only kind of scatter
operation is a Parallel Seq Scan (but see also the Parallel Index Scan
and Parallel Bitmap Scan proposals).  The scatter operation enables
parallelism in all the executor nodes above it, as far as the
enclosing 'gather' operation which must appear somewhere above it.
Currently the only kind of gather operation is a Gather node (but see
also the Gather Merge proposal which adds a new one).

The inner plan is built from a non-partial parallel-safe path and will
be run in every worker.

Note that a Hash Join node in 9.6 isn't parallel-aware itself: it's
not doing anything special at execution time to support parallelism.
The planner has determined that correct partial results will be
produced by this plan, but the executor nodes are blissfully unaware
of parallelism.

PROPOSED NEW PLAN VARIANTS

Shortly I will post a patch which introduces two new hash join plan
variants that are parallel-aware:

1.  Parallel Hash Join with Shared Hash

  ->  Parallel Hash Join
->  
->  Shared Hash
  ->  

In this case, there is only one copy of the hash table and only one
participant loads it.  The other participants wait patiently for one
chosen backend to finish building the hash table, and then they all
wake up and probe.

Call the number of participants P, being the number of workers + 1
(for the leader).  Compared to a non-shared hash plan, we avoid
wasting CPU and IO resources running P copies of the inner plan in
parallel (something that is not well captured in our costing model for
parallel query today), and we can allow ourselves to use a hash table
P times larger while sticking to the same overall space target of
work_mem * P.

2.  Parallel Hash Join with Parallel Shared Hash

  ->  Parallel Hash Join
->  
->  Parallel Shared Hash
  ->  

In this case, the inner plan is run in parallel by all participants.
We have the advantages of a shared hash table as described above, and
now we can also divide the work of running the inner plan and hashing
the resulting tuples by P participants.  Note that Parallel Shared
Hash is acting as a special kind of gather operation that is the
counterpart to the scatter operation contained in the inner plan.

PERFORMANCE

So far I have been unable to measure any performance degradation
compared with unpatched master for hash joins with non-shared hash.
That's good because it means that I didn't slow existing plans down
when I introduced a bunch of conditional branches to existing hash
join code.

Laptop testing shows greater than 2x speedups on several of the TPC-H
queries with single batches, and no slowdowns.  I will post test
numbers on big rig hardware in the coming weeks when I have the
batching code in more complete and stable shape.

IMPLEMENTATION

I have taken the approach of extending the existing hash join
algorithm, rather than introducing separate hash join executor nodes
or a fundamentally different algorithm.  Here's a short description of
what the patch does:

1.  SHARED HASH TABLE

To share data between participants, the patch uses two other patches I
have proposed:  DSA areas[1], which provide a higher level interface
to DSM segments to make programming with processes a little more like
programming with threads, and in particular a per-parallel-query DSA
area[2] that is made available for any executor node that needs some
shared work space.

The patch uses atomic operations to push tuples into the hash table
buckets while building, rehashing and loading, and then the hash table
is immutable during probing (except for match flags used to implement
outer joins).  The existing memory chunk design is retained for dense
allocation of tuples, which provides a convenient way to rehash the
table when its size changes.

2.  WORK COORDINATION

To coordinate parallel work, this patch uses two other patches:
barriers[3], to implement a 'barrier' or 'phaser' synchronisation
primitive, and those in turn use the condition variables proposed by
Robert Haas.

Barriers provide a way for participants to break work up into