Re: [HACKERS] Parallel Hash take II

2017-12-21 Thread Andres Freund
Hi,

> > Not really happy with the name. ExecParallelHashTableInsert() calling
> > ExecParallelHashLoadTuple() to insert a tuple into the hashtable doesn't
> > quite strike me as right; the naming similarity to
> > ExecParallelHashTableLoad seems problematic too.
> > ExecParallelHashAllocTuple() or such?
> >
> > One could argue it'd not be a bad idea to keep a similar split as
> > dense_alloc() and memcpy() have, but I'm not really convinced by
> > myself. Hm.
>
> Yeah, the names are confusing.  So:

Cool.


> > Just to make sure I understand: The "empty" phase is to protect the
> > process of the electee doing the sizing calculations etc?  And the
> > reason that's not possible to do just by waiting for
> > PHJ_GROW_BATCHES_REPARTITIONING is that somebody could dynamically
> > arrive, i.e. it'd not be needed in a statically sized barrier?
>
> Yeah, it's where you wait for the serial phase above to be finished.
> [ Explanation ] Does that make sense?

Yes.


>
> > +   /* reset temp memory each time to avoid leaks from 
> > qual expr */
> > +   ResetExprContext(econtext);
> > +
> > +   if (ExecQual(hjclauses, econtext))
> >
> > I personally think it's better to avoid this pattern and store the
> > result of the ExecQual() in a variable, ResetExprContext() and then act
> > on the result.  No need to keep memory around for longer, and for bigger
> > contexts you're more likely to have all the metadata in cache.
> >
> > I'd previously thought about introducing ExecQualAndReset() or such...
>
> Makes sense, but this is code that is identical in
> ExecScanHashBucket() so I think we should keep it this way for now,
> and explore expression context lifetime improvements in a separate
> patch?  Looks like the same change could be made in other nodes too.

Ok.


> > +
> > +   /*
> > +* If we started up so late that the shared batches have been freed
> > +* already by ExecHashTableDetach(), then we are finished.
> > +*/
> > +   if (!DsaPointerIsValid(hashtable->parallel_state->batches))
> > +   return false;
> >
> > This is really the only place that weird condition is detected? And why
> > is that possible in the first place? And if possible, shouldn't we have
> > detected earlier?  Also, if possible, what prevents this to occur in a
> > way that test fails, because pstate->batches is freed, but not yet
> > reset?
>
> ExecParallelHashJoinNewBatch(), where this code appears, is generally
> the place that we discover that we're finished.  Normally we discover
> that we're finished by seeing that there are no batches left to chew
> on, by inspecting the per-batch state in shmem.  This weird condition
> arises when a worker starts up so late that the join is finished and
> the shmem space used to tracks batches has already been freed.  I
> agree that that was badly explained and there was in fact something a
> bit kooky about that coding.  I have now changed it so that
> ExecParallelHashEnsureBatchAccessors() detects this case and has a
> better comment to explain it, and ExecParallelHashJoinNewBatch() now
> just looks out for hashtable->batches == NULL with a comment referring
> to the other place.

Thanks for updating.


My compiler complained that ExecHashJoinImpl() might not be
inlinable. That's just because you declared it always_inline without
actually making it an inline function...


Pushed.  Yeha!  Congrats, this has been quite the project.


I suspect we'll find a bunch of problems, both on the planning and
execution side, but I think at this point we're much more likely to find
and resolve these in-tree vs. out of tree.



Btw, I see dsa_get_address() show up pretty prominently in profiles. I
kinda wonder if there's some cases where we could ameliorate the cost by
recognizing that a bunch of lookups are all going to reside in the same
segment.

- Andres



Re: [HACKERS] Parallel Hash take II

2017-12-18 Thread Andres Freund
On 2017-12-14 21:06:34 +1300, Thomas Munro wrote:
> On Thu, Dec 14, 2017 at 11:45 AM, Andres Freund  wrote:
> > +   if (accessor->write_pointer + 
> > accessor->sts->meta_data_size >=
> > +   accessor->write_end)
> > +   elog(ERROR, "meta-data too long");
> >
> > That seems more like an Assert than a proper elog? Given that we're
> > calculating size just a few lines above...
> 
> It's an error because the logic is not smart enough to split the
> optional meta-data and tuple size over multiple chunks.  I have added
> comments there to explain.

I don't see how that requires it to be an elog rather than an assert. As
far as I can tell this is only reachable if
meta_data_size + sizeof(uint32) >= STS_CHUNK_DATA_SIZE. For anything
smaller the sts_flush_chunk() a few lines above would have started a new
chunk.

IOW, we should detect this at sts_initialize() time, elog there, and
Assert() out in puttuple.

I've changed the code like that, fixed my own < vs >= confusion during
the conversion, fixed a typo, and pushed.  If you're not ok with that
change, we can easily whack this around.

I've not changed this, but I wonder whether we should rename
sts_puttuple() to sts_put_tuple(), for consistency with
sts_read_tuple(). Or the other way round. Or...

Greetings,

Andres Freund



Re: [HACKERS] Parallel Hash take II

2017-12-14 Thread Andres Freund
Hi,

Looking at the main patch (v28).

First off: This looks pretty good, the code's quite readable now
(especially compared to earlier versions), the comments are good. Really
like the nodeHash split, and the always inline hackery in nodeHashjoin.
Think we're getting really really close.

  * ExecHashJoinImpl
  *
- * This function implements the Hybrid Hashjoin algorithm.
+ * This function implements the Hybrid Hashjoin algorithm.  By forcing it
+ * to be always inline many compilers are able to specialize it for
+ * parallel = true/false without repeating the code.
  *

what about adding the above explanation for the always inline?


+   /*
+* So far we have no idea whether there are any other 
participants,
+* and if so, what phase they are working on.  The only thing 
we care
+* about at this point is whether someone has already created 
the
+* SharedHashJoinBatch objects, the main hash table for batch 0 
and
+* (if necessary) the skew hash table yet.  One backend will be
+* elected to do that now if necessary.
+*/

The 'yet' sounds a bit weird in combination with the 'already'.


+ static void
+ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
+ ...
+   case PHJ_GROW_BATCHES_ELECTING:
+   /* Elect one participant to prepare the operation. */

That's a good chunk of code. I'm ever so slightly inclined to put that
into a separate function. But I'm not sure it'd look good. Feel entirely
free to disregard.


+ static HashJoinTuple
+ ExecParallelHashLoadTuple(HashJoinTable hashtable, MinimalTuple tuple,
+ dsa_pointer *shared)

Not really happy with the name. ExecParallelHashTableInsert() calling
ExecParallelHashLoadTuple() to insert a tuple into the hashtable doesn't
quite strike me as right; the naming similarity to
ExecParallelHashTableLoad seems problematic too.
ExecParallelHashAllocTuple() or such?

One could argue it'd not be a bad idea to keep a similar split as
dense_alloc() and memcpy() have, but I'm not really convinced by
myself. Hm.


+   case PHJ_GROW_BATCHES_ELECTING:
+   /* Elect one participant to prepare the operation. */

Imo that comment could use a one-line summary of what preparing means.


+   /*
+* We probably also need a smaller 
bucket array.  How many
+* tuples do we expect per batch, 
assuming we have only
+* half of them so far?

Makes sense, but did cost me a minute of thinking. Maybe add a short
explanation why.


+   case PHJ_GROW_BATCHES_ALLOCATING:
+   /* Wait for the above to be finished. */
+   BarrierArriveAndWait(&pstate->grow_batches_barrier,
+
WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING);
+   /* Fall through. */

Just to make sure I understand: The "empty" phase is to protect the
process of the electee doing the sizing calculations etc?  And the
reason that's not possible to do just by waiting for
PHJ_GROW_BATCHES_REPARTITIONING is that somebody could dynamically
arrive, i.e. it'd not be needed in a statically sized barrier?  Pretty
tired here, sorry ;)


+   /* reset temp memory each time to avoid leaks from qual 
expr */
+   ResetExprContext(econtext);
+
+   if (ExecQual(hjclauses, econtext))

I personally think it's better to avoid this pattern and store the
result of the ExecQual() in a variable, ResetExprContext() and then act
on the result.  No need to keep memory around for longer, and for bigger
contexts you're more likely to have all the metadata in cache.

I'd previously thought about introducing ExecQualAndReset() or such...


  * IDENTIFICATION
  *   src/backend/executor/nodeHashjoin.c
  *
+ * PARALLELISM
+ *

This is a pretty good explanation. How about adding a reference to it
from nodeHash.c's header?



+static TupleTableSlot */* return: a tuple or NULL */
+ExecHashJoin(PlanState *pstate)
+{
+   /*
+* On sufficiently smart compilers this should be inlined with the
+* parallel-aware branches removed.
+*/
+   return ExecHashJoinImpl(pstate, false);

Ah, the explanation I desired above is here. Still seems good to have a
comment at the somewhat suspicious use of always_inline.


+
+   /*
+* If we started up so late that the shared batches have been freed
+* already by ExecHashTableDetach(), then we are finished.
+*/
+   if (!DsaPointerIsValid(hashtable->parallel_state->batches))
+   return false;

This is really the only place that weird condition is 

Re: [HACKERS] Parallel Hash take II

2017-12-14 Thread Thomas Munro
On Thu, Dec 14, 2017 at 11:45 AM, Andres Freund  wrote:
> +   booloverflow;   /* Continuation of previous 
> chunk? */
> +   chardata[FLEXIBLE_ARRAY_MEMBER];
> +} SharedTuplestoreChunk;
>
> Ah. I was thinking we could have the 'overflow' variable be an int,
> indicating the remaining length of the oversized tuple. That'd allow us
> to skip ahead to the end of the oversized tuple in concurrent processes
> after hitting it.

Right, that is a bit better as it avoids extra read-skip cycles for
multi-overflow-chunk cases.  Done that way.

> +   if (accessor->write_pointer + 
> accessor->sts->meta_data_size >=
> +   accessor->write_end)
> +   elog(ERROR, "meta-data too long");
>
> That seems more like an Assert than a proper elog? Given that we're
> calculating size just a few lines above...

It's an error because the logic is not smart enough to split the
optional meta-data and tuple size over multiple chunks.  I have added
comments there to explain.  That error can be reached by CALL
test_sharedtuplestore(1, 1, 1, 32756, 1), but 32755 is OK.  My goal
here is to support arbitrarily large tuples, not arbitrarily large
per-tuple meta-data, since for my use case I only need 4 bytes (a hash
value).  This could be improved if required by later features
(probably anyone wanting more general meta-data would want variable
sized meta-data anyway, whereas this is fixed, and it would also be
nice if oversized tuples didn't have to start at the beginning of a
new chunk).

I fixed two nearby fencepost bugs: I made the limit that triggers that
error smaller by size(uint32) and fixed a problem when small tuples
appear after an oversize tuple in a final overflow chunk (found by
hacking the test module to create mixtures of different sized tuples).

> +   if (accessor->sts->meta_data_size > 0)
> +   memcpy(accessor->write_pointer, meta_data,
> +  accessor->sts->meta_data_size);
> +   written = accessor->write_end - 
> accessor->write_pointer -
> +   accessor->sts->meta_data_size;
> +   memcpy(accessor->write_pointer + 
> accessor->sts->meta_data_size,
> +  tuple, written);
>
> Also, shouldn't the same Assert() be here as well if you have it above?

No, when it comes to the tuple we just write as much of it as will
fit, and write the rest in the loop below.  Comments improved to make
that clear.

> +   ereport(ERROR,
> +   (errcode_for_file_access(),
> +errmsg("could not read from shared 
> tuplestore temporary file"),
> +errdetail("Short read while reading 
> meta-data")));
>
> The errdetail doesn't follow the style guide (not a sentence ending with
> .), and seems internal-ish. I'm ok with keeping it, but perhaps we
> should change all these to be errdetail_internal()? Seems pointless to
> translate all of them.

Done.

> +   LWLockAcquire(&p->lock, LW_EXCLUSIVE);
> +   eof = p->read_page >= p->npages;
> +   if (!eof)
> +   {
> +   read_page = p->read_page;
> +   p->read_page += STS_CHUNK_PAGES;
> +   }
> +   LWLockRelease(&p->lock);
>
> So if we went to the world I'm suggesting, with overflow containing the
> length till the end of the tuple, this'd probably would have to look a
> bit different.

Yeah.  I almost wanted to change it back to a spinlock but now it's
grown bigger again...

> +   if (!eof)
> +   {
> +   SharedTuplestoreChunk chunk_header;
> +
> +   /* Make sure we have the file open. */
> +   if (accessor->read_file == NULL)
> +   {
> +   charname[MAXPGPATH];
> +
> +   sts_filename(name, accessor, 
> accessor->read_participant);
> +   accessor->read_file =
> +   BufFileOpenShared(accessor->fileset, 
> name);
> +   if (accessor->read_file == NULL)
> +   elog(ERROR, "could not open temporary 
> file %s", name);
>
> Isn't this more an Assert or just not anything? There's now way
> BufFileOpenShared should ever return NULL, no?

Right.  As of commit 923e8dee this can no longer return NULL (instead
it would raise an error), so I removed this redundant check.

> +   /* If this is an overflow chunk, we skip it. */
> +   if (chunk_header.overflow)
> +   continue;
> +
> +   accessor->read_ntuples 

Re: [HACKERS] Parallel Hash take II

2017-12-13 Thread Andres Freund
Hi,

Looking at the latest version of the tuplestore patch:


diff --git a/src/backend/utils/sort/sharedtuplestore.c 
b/src/backend/utils/sort/sharedtuplestore.c
new file mode 100644
index 000..d1233221a58
--- /dev/null
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -0,0 +1,583 @@
+/*-
+ *
+ * sharedtuplestore.c
+ *   Simple mechanism for sharing tuples between backends.
+ *
+ * This module provides a shared temporary tuple storage mechanism, providing
+ * a parallel-aware subset of the features of tuplestore.c.  Multiple backends
+ * can write to a SharedTuplestore, and then multiple backends can later scan
+ * the stored tuples.  Currently, the only scan type supported is a parallel
+ * scan where each backend reads an arbitrary subset of the tuples that were
+ * written.

Cool.


+/* Chunk written to disk. */
+typedef struct SharedTuplestoreChunk
+{
+   int ntuples;/* Number of tuples in 
this chunk. */
+   booloverflow;   /* Continuation of previous 
chunk? */
+   chardata[FLEXIBLE_ARRAY_MEMBER];
+} SharedTuplestoreChunk;

Ah. I was thinking we could have the 'overflow' variable be an int,
indicating the remaining length of the oversized tuple. That'd allow us
to skip ahead to the end of the oversized tuple in concurrent processes
after hitting it.



+/*
+ * Write a tuple.  If a meta-data size was provided to sts_initialize, then a
+ * pointer to meta data of that size must be provided.
+ */
+void
+sts_puttuple(SharedTuplestoreAccessor *accessor, void *meta_data,
+MinimalTuple tuple)
+{
+   size_t  size;
+
+   /* Do we have our own file yet? */
+   if (accessor->write_file == NULL)
+   {
+   SharedTuplestoreParticipant *participant;
+   charname[MAXPGPATH];
+
+   /* Create one.  Only this backend will write into it. */
+   sts_filename(name, accessor, accessor->participant);
+   accessor->write_file = BufFileCreateShared(accessor->fileset, 
name);
+
+   /* Set up the shared state for this backend's file. */
+   participant = 
&accessor->sts->participants[accessor->participant];
+   participant->writing = true;/* for assertions only */
+   }
+
+   /* Do we have space? */
+   size = accessor->sts->meta_data_size + tuple->t_len;
+   if (accessor->write_pointer + size >= accessor->write_end)
+   {
+   if (accessor->write_chunk == NULL)
+   {
+   /* First time through.  Allocate chunk. */
+   accessor->write_chunk = (SharedTuplestoreChunk *)
+   MemoryContextAllocZero(accessor->context,
+  
STS_CHUNK_PAGES * BLCKSZ);
+   accessor->write_chunk->ntuples = 0;
+   accessor->write_pointer = 
&accessor->write_chunk->data[0];
+   accessor->write_end = (char *)
+   accessor->write_chunk + STS_CHUNK_PAGES * 
BLCKSZ;
+   }
+   else
+   {
+   /* See if flushing helps. */
+   sts_flush_chunk(accessor);
+   }
+
+   /* It may still not be enough in the case of a gigantic tuple. 
*/
+   if (accessor->write_pointer + size >= accessor->write_end)
+   {
+   size_t  written;
+
+   /*
+* We'll write the beginning of the oversized tuple, 
and then
+* write the rest in some number of 'overflow' chunks.
+*/
+   if (accessor->write_pointer + 
accessor->sts->meta_data_size >=
+   accessor->write_end)
+   elog(ERROR, "meta-data too long");

That seems more like an Assert than a proper elog? Given that we're
calculating size just a few lines above...


+   if (accessor->sts->meta_data_size > 0)
+   memcpy(accessor->write_pointer, meta_data,
+  accessor->sts->meta_data_size);
+   written = accessor->write_end - accessor->write_pointer 
-
+   accessor->sts->meta_data_size;
+   memcpy(accessor->write_pointer + 
accessor->sts->meta_data_size,
+  tuple, written);

Also, shouldn't the same Assert() be here as well if you have it above?

+static MinimalTuple
+sts_read_tuple(SharedTuplestoreAccessor *accessor, void *meta_data)
+{
+   MinimalTuple tuple;
+   uint32  size;
+   size_t  remaining_size;
+   size_t  

Re: [HACKERS] Parallel Hash take II

2017-12-12 Thread Thomas Munro
On Sat, Dec 2, 2017 at 3:46 PM, Andres Freund  wrote:
> Looking at 0004-Add-shared-tuplestores.patch
>
> Comments:
> - I'd rename mutex to lock. Seems quite possible that we end up with shared
>   lockers too.

Done.

> - Expand "Simple mechanism for sharing tuples between backends." intro
>   comment - that doesn't really seem like a meaningful description of
>   the mechanism. Should probably mention that it's similar to
>   tuplestores etc...

Done.

> - I'm still concerned about the chunking mechanism. How about this
>   sketch of an alternative solution?
>
>   Chunks are always the same length. To avoid having to read the length
>   from disk while holding a lock, introduce continuation chunks which
>   store the amount of space needed by the overlarge tuple at the
>   start. The reading process stays largely the same, except that if a
>   backend reads a chunk that's a continuation, it reads the length of
>   the continuation and skips ahead. That could mean that more than one
>   backend read continuation chunks, but the window is small and there's
>   normally not goign to be that many huge tuples (otherwise things are
>   slow anyway).

Done.

I've also included a simple test harness that can be used to drive
SharedTuplestore independently of Parallel Hash, but that patch is not
for commit.  See example of usage in the commit message.
(Incidentally I noticed that ParallelWorkerNumber is not marked
PGDLLIMPORT so that fails to build on Windows CI.)

> - why are we using a separate hardcoded 32 for sts names? Why not just
>   go for NAMEDATALEN or such?

Done.

> - I'd replace most of the "should's" in comments with "need".

Done.

Another problem I discovered is that v27's way of installing a
different function into ExecProcNode in ExecInitHashJoin() was broken:
it didn't allow for the possibility that there is no DSA area
available due to lack of resources.  ExecInitNode() is too soon to
decide.  My solution is to provide a way for executor nodes to change
their ExecProcNode functions at any later time, which requires a way
for execProcnode.c to redo any wrapper functions.

-- 
Thomas Munro
http://www.enterprisedb.com


parallel-hash-v28.patchset.tgz
Description: GNU Zip compressed data


Re: [HACKERS] Parallel Hash take II

2017-12-07 Thread Andres Freund
On 2017-12-08 12:07:04 +1300, Thomas Munro wrote:
> I suppose if we wanted to make this type of problem go away, but still
> keep files for forensic purposes, we could introduce a "restart
> number", and stick it into the top level temporary directory's name.
> That way you'd never be able to collide with files created before a
> crash-restart, and we could add O_EXCL so it'd become an error to try
> to create the same filename again.

I'm deeply unconvinced by the "forensic" argument to not do temp file
cleanup after crash restarts. That causes problems like the one we're
debating upthread in the first place, so I'm wholly unconvinced we
should add to that further by adding another layer of complexity.

My personal opinion is that we should just do temp file cleanup after
crash restarts, and document restart_after_crash = false as the solution
for investigating crashes.  I don't want to hold up this patch with a
discussion of that however, so I'm ok with your fix.

Greetings,

Andres Freund



Re: [HACKERS] Parallel Hash take II

2017-12-07 Thread Thomas Munro
On Fri, Dec 8, 2017 at 12:07 PM, Thomas Munro
 wrote:
> I suppose if we wanted to make this type of problem go away, but still
> keep files for forensic purposes, we could introduce a "restart
> number", and stick it into the top level temporary directory's name.
> That way you'd never be able to collide with files created before a
> crash-restart, and we could add O_EXCL so it'd become an error to try
> to create the same filename again.

Or we could teach crash-restart to move the top level directory (in
each tablespace) to pgsql_tmp.old, so we'd keep the temporary files
from one previous lifetime only.  That'd prevent unlimited space
eating in multiple crash scenarios, and we could more comfortably say
that it's entirely safe to delete that directory manually in cases
like this:

https://www.postgresql.org/message-id/flat/4f3c89a224ff4660baa62a2b79fb0f1d%40ITUPW-EXMBOX3B.UniNet.unisa.edu.au

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Parallel Hash take II

2017-12-07 Thread Thomas Munro
On Sat, Dec 2, 2017 at 4:46 PM, Thomas Munro
 wrote:
> On Sat, Dec 2, 2017 at 3:54 PM, Thomas Munro
>  wrote:
>> On Sat, Dec 2, 2017 at 1:55 PM, Andres Freund  wrote:
>>> - As we don't clean temp files after crash-restarts it isn't impossible
>>>   to have a bunch of crash-restarts and end up with pids *and* per-pid
>>>   shared file set counters reused. Which'd lead to conflicts. Do we care?
>>
>> We don't care.  PathNameCreateTemporaryDir() on a path that already
>> exists will return silently.  PathNameCreateTemporaryFile() on a path
>> that already exists, as mentioned in a comment and following an
>> existing convention, will open and truncate it.  So either it was
>> really an orphan and that is a continuation of our traditional
>> recycling behaviour, or the calling code has a bug and used the same
>> path twice and it's not going to end well.
>
> On further reflection, there are problems with that higher up.  (1)
> Even though PathNameCreateTemporaryFile() will happily truncate and
> reuse an orphaned file when BufFileCreateShared() calls it,
> BufFileOpenShared() could get confused by the orphaned files.  It
> could believe that XXX.1 is a continuation of XXX.0, when in fact it
> is junk left over from a crash restart.  Perhaps BufFileCreateShared()
> needs to delete XXX.{N+1} if it exists, whenever it creates XXX.{N}.
> That will create a gap in the series of existing files that will cause
> BufFileOpenShared()'s search to terminate.  (2) BufFileOpenShared()
> thinks that the absence of an XXX.0 file means there is no BufFile by
> this name, when it could mistakenly open pre-crash junk due to a
> colliding name.  I use that condition as information, but I think I
> can fix that easily by using SharedTuplestoreParticipant::npage == 0
> to detect that there should be no file, rather than trying to open it,
> and then I can define this problem away by declaring that
> BufFileOpenShared() on a name that you didn't call
> BufFileCreateShared() on invokes undefined behaviour.  I will make it
> so.

Here is a patch to deal with that problem.  Thoughts?

I suppose if we wanted to make this type of problem go away, but still
keep files for forensic purposes, we could introduce a "restart
number", and stick it into the top level temporary directory's name.
That way you'd never be able to collide with files created before a
crash-restart, and we could add O_EXCL so it'd become an error to try
to create the same filename again.

I'll post a new SharedTuplestore and Parallel Hash patch set shortly.

-- 
Thomas Munro
http://www.enterprisedb.com


0001-Add-defenses-against-pre-crash-files-to-BufFileOpenS.patch
Description: Binary data


Re: [HACKERS] Parallel Hash take II

2017-12-02 Thread Andres Freund
On 2017-12-02 15:54:29 +1300, Thomas Munro wrote:
> On Sat, Dec 2, 2017 at 1:55 PM, Andres Freund  wrote:
> > - Right now RemovePgTempFilesInDir() will recurse into appropriately
> >   named directories, and when it recurses it doesn't require the same
> >   name pattern checks. I think that's good, but I think it'd be prudent
> >   to be a bit more paranoid and prevent recursing into symlinked
> >   subdirectories.
> 
> That's why it uses lstat(), so that it sees symlinks rather than what
> they point to. It only recurses if S_ISDIR(), and it unlinks anything
> else.

Right. I'd somehow confused myself by thinking one'd need an explicit
S_ISLINK check...


> Just a reminder: a couple of problems have come up recently in the
> Parallel Hash Join patch itself, so please don't consider that one
> ready for commit quite yet.  They are: (1) Handling the case where
> there is no DSA area because we're running a parallel-aware plan in
> non-parallel mode due to lack of resources; (2) Investigating a rare
> assertion failure.  For (1), that may depend on another patch that
> I'll post shortly to kill "es_query_dsa" and, come to think of it, for
> (2) it's possible that the problem is in either one of the remaining
> patches -- SharedTuplestore or Parallel Hash Join -- so please hold
> off on committing either of those until I've got to the bottom of
> that.

I'm a bit tempted to press ahead regardless of these issues. With your
consent obviously. ISTM we're pretty close to the point where this needs
to be exposed more widely and that'll surely bring more issues to light.

Greetings,

Andres Freund



Re: [HACKERS] Parallel Hash take II

2017-12-01 Thread Thomas Munro
On Sat, Dec 2, 2017 at 3:54 PM, Thomas Munro
 wrote:
> On Sat, Dec 2, 2017 at 1:55 PM, Andres Freund  wrote:
>> - As we don't clean temp files after crash-restarts it isn't impossible
>>   to have a bunch of crash-restarts and end up with pids *and* per-pid
>>   shared file set counters reused. Which'd lead to conflicts. Do we care?
>
> We don't care.  PathNameCreateTemporaryDir() on a path that already
> exists will return silently.  PathNameCreateTemporaryFile() on a path
> that already exists, as mentioned in a comment and following an
> existing convention, will open and truncate it.  So either it was
> really an orphan and that is a continuation of our traditional
> recycling behaviour, or the calling code has a bug and used the same
> path twice and it's not going to end well.

On further reflection, there are problems with that higher up.  (1)
Even though PathNameCreateTemporaryFile() will happily truncate and
reuse an orphaned file when BufFileCreateShared() calls it,
BufFileOpenShared() could get confused by the orphaned files.  It
could believe that XXX.1 is a continuation of XXX.0, when in fact it
is junk left over from a crash restart.  Perhaps BufFileCreateShared()
needs to delete XXX.{N+1} if it exists, whenever it creates XXX.{N}.
That will create a gap in the series of existing files that will cause
BufFileOpenShared()'s search to terminate.  (2) BufFileOpenShared()
thinks that the absence of an XXX.0 file means there is no BufFile by
this name, when it could mistakenly open pre-crash junk due to a
colliding name.  I use that condition as information, but I think I
can fix that easily by using SharedTuplestoreParticipant::npage == 0
to detect that there should be no file, rather than trying to open it,
and then I can define this problem away by declaring that
BufFileOpenShared() on a name that you didn't call
BufFileCreateShared() on invokes undefined behaviour.  I will make it
so.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Parallel Hash take II

2017-12-01 Thread Andres Freund
On 2017-11-27 22:25:12 +1300, Thomas Munro wrote:
> On Thu, Nov 23, 2017 at 12:36 AM, Thomas Munro
>  wrote:
> > Here's a new patch set with responses to the last batch of review comments.

Looking at 0004-Add-shared-tuplestores.patch

Comments:
- I'd rename mutex to lock. Seems quite possible that we end up with shared
  lockers too.
- Expand "Simple mechanism for sharing tuples between backends." intro
  comment - that doesn't really seem like a meaningful description of
  the mechanism. Should probably mention that it's similar to
  tuplestores etc...
- I'm still concerned about the chunking mechanism. How about this
  sketch of an alternative solution?

  Chunks are always the same length. To avoid having to read the length
  from disk while holding a lock, introduce continuation chunks which
  store the amount of space needed by the overlarge tuple at the
  start. The reading process stays largely the same, except that if a
  backend reads a chunk that's a continuation, it reads the length of
  the continuation and skips ahead. That could mean that more than one
  backend read continuation chunks, but the window is small and there's
  normally not goign to be that many huge tuples (otherwise things are
  slow anyway).
- why are we using a separate hardcoded 32 for sts names? Why not just
  go for NAMEDATALEN or such?
- I'd replace most of the "should's" in comments with "need".

Greetings,

Andres Freund



Re: [HACKERS] Parallel Hash take II

2017-12-01 Thread Thomas Munro
On Sat, Dec 2, 2017 at 1:55 PM, Andres Freund  wrote:
> Pushed 0003-Add-infrastructure-for-sharing-temporary-files-betwe.patch
> after minor adjustments (some including conversing with you).

Thank you!

> Questions:
> - Right now RemovePgTempFilesInDir() will recurse into appropriately
>   named directories, and when it recurses it doesn't require the same
>   name pattern checks. I think that's good, but I think it'd be prudent
>   to be a bit more paranoid and prevent recursing into symlinked
>   subdirectories.

That's why it uses lstat(), so that it sees symlinks rather than what
they point to. It only recurses if S_ISDIR(), and it unlinks anything
else.  That means that it unlinks any symlinks rather than (say)
following them and deleting stuff outside the temporary directory
tree.  Example:

$ mkdir /tmp/foo
$ touch /tmp/foo/canary
$ mkdir -p $PGDATA/base/pgsql_tmp/pgsql_tmpXXX/bar
$ ln -s /tmp/foo $PGDATA/base/pgsql_tmp/pgsql_tmpXXX/foo
$ ls $PGDATA/base/pgsql_tmp/pgsql_tmpXXX
bar foo
$ postgres/bin/postgres -D $PGDATA
... ^C ...
$ ls $PGDATA/base/pgsql_tmp
$ ls /tmp/foo
canary

Make sense?

> - As we don't clean temp files after crash-restarts it isn't impossible
>   to have a bunch of crash-restarts and end up with pids *and* per-pid
>   shared file set counters reused. Which'd lead to conflicts. Do we care?

We don't care.  PathNameCreateTemporaryDir() on a path that already
exists will return silently.  PathNameCreateTemporaryFile() on a path
that already exists, as mentioned in a comment and following an
existing convention, will open and truncate it.  So either it was
really an orphan and that is a continuation of our traditional
recycling behaviour, or the calling code has a bug and used the same
path twice and it's not going to end well.

Another type of collision we could have in theory is like this:  One
backend creates a SharedFileSet, and various other backends attach to
it.  The creator detaches and exits.  Later another process is created
with the same PID, and creates a new SharedFileSet, and happens to
have the same counter number, but the original SharedFileSet is still
in existence because there are still running backends attached to it.
Then things get ugly.  But this seems improbable, at least as long as
the usage pattern is that leader processes are the only ones creating
SharedFileSet objects and outlive their parallel workers in non-crash
outcomes, but if you think that's too flimsy I suppose it could be
fixed by replacing the per-backend counter variable with a
cluster-wide atomic counter.  Then the PID component would be
redundant but I'd suggest keeping it anyway for its diagnostic value.
Thoughts?

Just a reminder: a couple of problems have come up recently in the
Parallel Hash Join patch itself, so please don't consider that one
ready for commit quite yet.  They are: (1) Handling the case where
there is no DSA area because we're running a parallel-aware plan in
non-parallel mode due to lack of resources; (2) Investigating a rare
assertion failure.  For (1), that may depend on another patch that
I'll post shortly to kill "es_query_dsa" and, come to think of it, for
(2) it's possible that the problem is in either one of the remaining
patches -- SharedTuplestore or Parallel Hash Join -- so please hold
off on committing either of those until I've got to the bottom of
that.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Parallel Hash take II

2017-12-01 Thread Andres Freund
Hi,

On 2017-11-27 22:25:12 +1300, Thomas Munro wrote:
Pushed 0003-Add-infrastructure-for-sharing-temporary-files-betwe.patch
after minor adjustments (some including conversing with you).

Changes:
- Changed an impossible elog() into an Assert().
- changed SharedFileSet->counter, and the backend static variable, to
  uint32. Not impossible that a 32bit int overflows over the course of a
  few weeks, and we a) imo shouldn't unnecessarily rely on signed
  overflow being defined b) a negative number would look weird, even if
  well defined (-fwrapv et al).
- Added a small comment about arbitrary-ness of the 8 in Oid
  tablespaces[8].
- pgindent'ed

Questions:
- Right now RemovePgTempFilesInDir() will recurse into appropriately
  named directories, and when it recurses it doesn't require the same
  name pattern checks. I think that's good, but I think it'd be prudent
  to be a bit more paranoid and prevent recursing into symlinked
  subdirectories.
- As we don't clean temp files after crash-restarts it isn't impossible
  to have a bunch of crash-restarts and end up with pids *and* per-pid
  shared file set counters reused. Which'd lead to conflicts. Do we care?

Greetings,

Andres Freund



Re: [HACKERS] Parallel Hash take II

2017-11-29 Thread Andres Freund
On 2017-11-30 14:17:51 +1300, Thomas Munro wrote:
> On Mon, Nov 27, 2017 at 10:25 PM, Thomas Munro
>  wrote:
> > On Thu, Nov 23, 2017 at 12:36 AM, Thomas Munro
> >  wrote:
> >> Here's a new patch set with responses to the last batch of review comments.
> >
> > Rebased on top of the recent SGML->XML change.
>
> Andres asked me off-list how I tested the barrier.c case where a
> backend detaches, releasing other waiters.  There are special cases in
> BarrierArriveAndWait() and BarrierDetach() for that to make sure that
> the phase advances and waiters are released if they were only waiting
> for this one backend to arrive, and that exactly one of them is
> "elected" for any serial work.  Normally the last to arrive is elected
> (see earlier discussion about why that's a good idea), but if the one
> that would be last detaches instead of arriving then we'll have to
> elect someone else.  Here is a throw-away test harness that can be
> used to exercise that case.  CREATE EXTENSION test_barrier; SELECT
> test_barrier_detach_releases(1);.  Adding a sleep before BarrierDetach
> can be used to influence the race, and adding elog messages to
> barrier.c can be used to see when the special path is taken.

Played some with this, and confirmed that that case
works. Pushed. Interestingly my first few experiments with delaying the
master using pg_usleep() were *not* successful because the leader gets
signalled by the workers successfully having started, interrupting the
sleep...

I've not made any meaningful changes, there was a * missing in an empty
comment line, and I think I've added a single comma.

I do wonder if it's maybe worth adding a single graf about error
handling to the intro? It may not be obvious that this requires
installing a hook calling detach.

- Andres



Re: [HACKERS] Parallel Hash take II

2017-11-29 Thread Thomas Munro
On Mon, Nov 27, 2017 at 10:25 PM, Thomas Munro
 wrote:
> On Thu, Nov 23, 2017 at 12:36 AM, Thomas Munro
>  wrote:
>> Here's a new patch set with responses to the last batch of review comments.
>
> Rebased on top of the recent SGML->XML change.

Andres asked me off-list how I tested the barrier.c case where a
backend detaches, releasing other waiters.  There are special cases in
BarrierArriveAndWait() and BarrierDetach() for that to make sure that
the phase advances and waiters are released if they were only waiting
for this one backend to arrive, and that exactly one of them is
"elected" for any serial work.  Normally the last to arrive is elected
(see earlier discussion about why that's a good idea), but if the one
that would be last detaches instead of arriving then we'll have to
elect someone else.  Here is a throw-away test harness that can be
used to exercise that case.  CREATE EXTENSION test_barrier; SELECT
test_barrier_detach_releases(1);.  Adding a sleep before BarrierDetach
can be used to influence the race, and adding elog messages to
barrier.c can be used to see when the special path is taken.

-- 
Thomas Munro
http://www.enterprisedb.com
From 4ba3fe2e318dd1a7b54b3b9ae8853531ed3a6e12 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 30 Nov 2017 14:00:10 +1300
Subject: [PATCH] A simple test module for barrier.c.

Some things to try:

CREATE EXTENSION test_barrier;
SELECT test_barrier_detach_releases(1);
SELECT test_barrier_reattach_random(4, 1000);
SELECT test_barrier_alternate(4, 1000);
---
 src/test/modules/test_barrier/Makefile |  18 ++
 .../modules/test_barrier/test_barrier--1.0.sql |  18 ++
 src/test/modules/test_barrier/test_barrier.c   | 270 +
 src/test/modules/test_barrier/test_barrier.control |   4 +
 4 files changed, 310 insertions(+)
 create mode 100644 src/test/modules/test_barrier/Makefile
 create mode 100644 src/test/modules/test_barrier/test_barrier--1.0.sql
 create mode 100644 src/test/modules/test_barrier/test_barrier.c
 create mode 100644 src/test/modules/test_barrier/test_barrier.control

diff --git a/src/test/modules/test_barrier/Makefile 
b/src/test/modules/test_barrier/Makefile
new file mode 100644
index 000..9f330b6a45d
--- /dev/null
+++ b/src/test/modules/test_barrier/Makefile
@@ -0,0 +1,18 @@
+# src/test/modules/test_barrier/Makefile
+
+MODULES = test_barrier
+
+EXTENSION = test_barrier
+DATA = test_barrier--1.0.sql
+PGFILEDESC = "test_barrier -- some tests for barrier.c"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_barrier
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_barrier/test_barrier--1.0.sql 
b/src/test/modules/test_barrier/test_barrier--1.0.sql
new file mode 100644
index 000..7077f3f6567
--- /dev/null
+++ b/src/test/modules/test_barrier/test_barrier--1.0.sql
@@ -0,0 +1,18 @@
+\echo Use "CREATE EXTENSION test_barrier" to load this file. \quit
+
+CREATE FUNCTION test_barrier_alternate(workers int, loops int)
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+CREATE FUNCTION test_barrier_reattach_random(workers int, end_phase int)
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+CREATE FUNCTION test_barrier_detach_releases(workers int)
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+
diff --git a/src/test/modules/test_barrier/test_barrier.c 
b/src/test/modules/test_barrier/test_barrier.c
new file mode 100644
index 000..98adefa809d
--- /dev/null
+++ b/src/test/modules/test_barrier/test_barrier.c
@@ -0,0 +1,270 @@
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "postmaster/bgworker.h"
+#include "storage/barrier.h"
+#include "storage/dsm.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/resowner.h"
+
+#include 
+#include 
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_barrier_alternate);
+PG_FUNCTION_INFO_V1(test_barrier_reattach_random);
+PG_FUNCTION_INFO_V1(test_barrier_detach_releases);
+
+extern Datum test_barrier_main(Datum);
+
+typedef enum test_mode
+{
+   TEST_MODE_ALTERNATE,
+   TEST_MODE_REATTACH_RANDOM,
+   TEST_MODE_DETACH_RELEASES
+}  test_mode;
+
+typedef struct test_barrier_alternate_state
+{
+   Barrier barrier1;
+   Barrier barrier2;
+   int loops;
+}  test_barrier_alternate_state;
+
+typedef struct test_barrier_reattach_random_state
+{
+   Barrier barrier;
+   int end_phase;
+}  test_barrier_reattach_random_state;
+
+typedef struct test_barrier_detach_releases_state
+{
+   Barrier barrier;
+}  test_barrier_detach_releases_state;
+
+typedef struct test_barrier_state
+{
+   test_mode 

Re: [HACKERS] Parallel Hash take II

2017-11-27 Thread Thomas Munro
On Thu, Nov 23, 2017 at 12:36 AM, Thomas Munro
 wrote:
> Here's a new patch set with responses to the last batch of review comments.

Rebased on top of the recent SGML->XML change. Also:

1.  The final patch in the v26 patchset extended EXPLAIN ANALYZE
output to show per-worker information.  I'm withdrawing that patch for
now.  If you want to see how many tuples each backend hashed you can
already do that with (ANALYZE, VERBOSE).  It's a pre-existing bug that
you don't get batch/bucket/size info when Hash Join doesn't run in the
leader, and it's a pre-existing bug that EXPLAIN doesn't show
information for the leader separately.  I decided that it's not this
patchset's job to fix that stuff, and it's not entirely clear what the
best approach is anyway.  Let's discuss the way that information is
captured and displayed separately from the Parallel Hash feature.

2.  I found a way to crash v26 by starting a worker very late.  Fixed.

Unfortunately I saw a one-off case of an assertion failure in
ExecParallelHashRepartitionRest()/sts_begin_parallel_scan() on Travis
CI that I can't explain.  I haven't been able to reproduce it there or
on any other machine since.  I am still looking into it.

-- 
Thomas Munro
http://www.enterprisedb.com


parallel-hash-v27.patchset.tgz
Description: GNU Zip compressed data


Re: [HACKERS] Parallel Hash take II

2017-11-22 Thread Thomas Munro
Hi Andres and Peter,

Here's a new patch set with responses to the last batch of review comments.

On Wed, Nov 15, 2017 at 10:24 AM, Andres Freund  wrote:
> Hm. The way you access this doesn't quite seem right:
> ...
> +  matches := regexp_matches(line, '  Batches: ([0-9]+)');
> ...
>
> Why not use format json and access the output that way? Then you can be
> sure you access the right part of the tree and such?

Okay, done.

>> 1.  It determines the amount of unfairness when we run out of data:
>> it's the maximum amount of extra data that the unlucky last worker can
>> finish up with after all the others have finished.  I think this
>> effect is reduced by higher level factors: when a reader runs out of
>> data in one backend's file, it'll start reading another backend's
>> file.  If it's hit the end of all backends' files and this is an outer
>> batch, Parallel Hash will just go and work on another batch
>> immediately.
>
> Consider e.g. what happens if there's the occasional 500MB datum, and
> the rest's very small...
>
>> Better ideas?
>
> Not really. I'm more than a bit suspicous of this solution, but I don't
> really have a great suggestion otherwise.  One way to combat extreme
> size skew would be to put very large datums into different files.

Right.  I considered opening a separate file for each chunk size (4
page, 8 page, 16 page, ...).  Then each file has uniform chunk size,
so you're not stuck with a large chunk size after one monster tuple
comes down the pipe.  I didn't propose that because it leads to even
more file descriptors being used.  I'd like to move towards fewer file
descriptors, because it's a cap on the number of partitions you can
reasonably use.  Perhaps in future we could develop some kind of
general purpose file space manager that would let us allocate extents
within a file, and then SharedTuplestore could allocate extents for
each chunk size.  Not today though.

> But I think we probably can go with your approach for now, ignoring my
> failure prone spidey senses ;)

Cool.

>> > This looks more like the job of an lwlock rather than a spinlock.
>>
>> ... assembler ...
>>
>> That should be OK, right?
>
> It's not too bad. Personally I'm of the opinion though that pretty much
> no new spinlocks should be added - their worst case performance
> characteristics are bad enough for that to be only worth the
> experimentation in case swhere each cycle really matters and where
> contention is unlikely.

Changed to LWLock.

>> > One day we're going to need a better approach to this. I have no idea
>> > how, but this per-node, and now per_node * max_parallelism, approach has
>> > only implementation simplicity as its benefit.
>>
>> I agree, and I am interested in that subject.  In the meantime, I
>> think it'd be pretty unfair if parallel-oblivious hash join and
>> sort-merge join and every other parallel plan get to use work_mem * p
>> (and in some cases waste it with duplicate data), but Parallel Hash
>> isn't allowed to do the same (and put it to good use).
>
> I'm not sure I care about fairness between pieces of code ;)

I'll leave that discussion to run in a separate subthread...

>> BTW this is not per-tuple code -- it runs once at the end of hashing.
>> Not sure what you're looking for here.
>
> It was more a general statement about all the branches in nodeHashjoin,
> than about these specific branches. Should've made that clearer. There's
> definitely branches in very common parts:
>
> ...
>
> I don't think you should do so now, but I think a reasonable approach
> here would be to move the HJ_BUILD_HASHTABLE code into a separate
> function (it really can't be hot). Then have specialized ExecHashJoin()
> versions for parallel/non-parallel and potentially for outer/inner/anti.

Okay, here's my proposal for how to get new branches out of the
per-tuple path.  I have separated ExecHashJoin() and
ExecParallelHashJoin() functions.  They call an inline function
ExecHashJoinImpl() with a constant parameter, so that I effectively
instantiate two variants with the unwanted branches removed by
constant folding.  Then the appropriate variant is installed as the
ExecProcNode function pointer.

Just "inline" wasn't enough though.  I defined
pg_attribute_always_inline to force that on GCC/Clang et al and MSVC.

I also created a separate function ExecParallelScanHashBucket().  I
guess I could have extended the above trick into ExecScanHashBucket()
and further too, but it's called from a different translation unit,
and I also don't want to get too carried away with this technique.  I
chose to supply different functions.

So -- that's introducing a couple of new techniques into the tree.
Treating ExecProcNode as a configuration point for a single node type,
and the function instantiation trick.  Thoughts?

>> > If we don't split this into two versions, we at least should store
>> > hashNode->parallel_state in a local var, so the compiler doesn't have to
>> > pull that out of memory after every e

Re: [HACKERS] Parallel Hash take II

2017-11-17 Thread Andres Freund
Hi,

On 2017-11-14 01:30:30 +1300, Thomas Munro wrote:
> New patch attached.

0002-Add-a-barrier-primitive-for-synchronizing-backends.patch
- Intro starts with:
+ *
+ * This implementation of barriers allows for static sets of participants
+ * known up front, or dynamic sets of participants which processes can join or
  that's a bit low-level, no? Should explain what they're for, not
  everyone's going to be familiar with the concept.

- The comment for BarrierInit() reads a bit weird:
+ * Initialize this barrier, setting a static number of participants that we
+ * will wait for at each computation phase.  To use a dynamic number of
+ * participants, this number should be zero, and BarrierAttach and
  Set a static number! Except maybe not?

- I'm not entirely convinced that we want the barrier debug stuff
  merged, what are your feelings about that?  It's like half the code,
  and adds some complexity to the non debug code... If we indeed want to
  keep it, it probably should be documented in config.sgml? And get an
  entry in pg_config_manual.h?

- Can we add assertions ensuring nobody attaches to a static barrier?

- If I understand correctly currently the first participant to arrive at
  a barrier is going to be selected, and the last wakes everyone
  up. Wouldn't it be better to do have the last arrived participant be
  selected? It already got a scheduler timeslice, it's memory is most
  likely to be in cache etc? Given that in cases where selection plays a
  role it's going to be blocking everyone else, using the process most
  likely to finish first seems like a good idea.  I guess the
  BarrierDetach() implementation would be slightly more complex?


Regards,

Andres



Re: [HACKERS] Parallel Hash take II

2017-11-17 Thread Peter Geoghegan
On Fri, Nov 17, 2017 at 1:55 PM, Andres Freund  wrote:
> Looking at 0005-Add-infrastructure-for-sharing-temporary-files-betwe.patch:
> - The created path/filenames seem really redundant:
>   base/pgsql_tmp/pgsql_tmp11160.9.sharedfileset.d/pgsql_tmp.o3of8.p0.0
>
>   Including pgsql_tmp no less than three times seems a bit absurd.
>
>   I'm quite inclined to just remove all but the first.

+1

> - It's not clear to me why it's correct to have the vfdP->fdstate & 
> FD_TEMPORARY
>   handling in FileClose() be independent of the file being deleted. At
>   the very least there needs to be a comment explaining why we chose
>   that behaviour.

Isn't that just because only one backend is supposed to delete the
file, but they all must close it and do temp_file_limit accounting?
Sorry if I missed something (my explanation seems too obvious to be
correct).

> - I think we need to document somehwere that the temp_file_limit in a
>   shared file set applies independently for each participant that's
>   writing something.  We also should discuss whether that's actually
>   sane behaviour.

This is already the documented behavior of temp_file_limit, fwiw.

Another question for Thomas: Is it okay that routines like
BufFileOpenShared() introduce new palloc()s (not repalloc()s) to
buffile.c, given that struct BufFile already contains this?:

/*
 * resowner is the ResourceOwner to use for underlying temp files.  (We
 * don't need to remember the memory context we're using explicitly,
 * because after creation we only repalloc our arrays larger.)
 */
ResourceOwner resowner;

Maybe we need to remember the original caller's memory context, too?
Either that, or the contract/comments for memory contexts need to be
revised.

-- 
Peter Geoghegan



Re: [HACKERS] Parallel Hash take II

2017-11-17 Thread Andres Freund
Hi,

On 2017-11-14 01:30:30 +1300, Thomas Munro wrote:
> New patch attached.

(I've commit some of the preliminary work)

Looking at 0005-Add-infrastructure-for-sharing-temporary-files-betwe.patch:
- The created path/filenames seem really redundant:
  base/pgsql_tmp/pgsql_tmp11160.9.sharedfileset.d/pgsql_tmp.o3of8.p0.0

  Including pgsql_tmp no less than three times seems a bit absurd.

  I'm quite inclined to just remove all but the first.

- There seems to be a moment where could leak temporary file
  directories:
File
SharedFileSetCreate(SharedFileSet *fileset, const char *name)
{
charpath[MAXPGPATH];
Filefile;

SharedFilePath(path, fileset, name);
file = PathNameCreateTemporaryFile(path, false);

/* If we failed, see if we need to create the directory on demand. */
if (file <= 0)
{
chartempdirpath[MAXPGPATH];
charfilesetpath[MAXPGPATH];
Oid tablespace = ChooseTablespace(fileset, 
name);

TempTablespacePath(tempdirpath, tablespace);
SharedFileSetPath(filesetpath, fileset, tablespace);
PathNameCreateTemporaryDir(tempdirpath, filesetpath);
file = PathNameCreateTemporaryFile(path, true);
}

return file;
}

  The resowner handling is done in PathNameCreateTemporaryFile(). But if
  we fail after creating the directory we'll not have a resowner for
  that. That's probably not too bad.

- related to the last point, I'm kinda wondering why we need sub-fileset
  resowners? Given we're dealing with error paths in resowners I'm not
  quite seeing the point - we're not going to want to roll back
  sub-parts of of a fileset, no?

- If we want to keep these resowners, shouldn't we unregister them in
  PathNameDeleteTemporaryFile?

- PathNameCreateTemporaryFile() and OpenTemporaryFile() now overlap
  quite a bit. Can't we rejigger things to base the second on the first?
  At the very least the comments need to work out the difference more
  closely.

- It's not clear to me why it's correct to have the vfdP->fdstate & FD_TEMPORARY
  handling in FileClose() be independent of the file being deleted. At
  the very least there needs to be a comment explaining why we chose
  that behaviour.

- I think we need to document somehwere that the temp_file_limit in a
  shared file set applies independently for each participant that's
  writing something.  We also should discuss whether that's actually
  sane behaviour.

Greetings,

Andres Freund



Re: [HACKERS] Parallel Hash take II

2017-11-17 Thread Robert Haas
On Thu, Nov 16, 2017 at 6:42 PM, Andres Freund  wrote:
> What I'm basically worried about is that the *only* reason for some
> plans to choose to use parallelism is that essentially the effective
> amount of work_mem between the plans is that the parallel one uses
> (max_parallel_workers_per_gather + 1) * work_mem. Which might push
> queries to use parallelism even if it's not actually beneficial in
> reducing runtime.

I don't get it.  If we switch to using parallelism and the runtime
doesn't go down, that just means the costing is wrong.  The costing is
supposed to reflect the runtime.

It's true that adding parallel hash may tend to increase the amount of
memory that gets used in practice.  But it seems to me that any plan
that uses memory and is sometimes chosen over a non-memory-using plan
does that.

> Thomas' earlier comparison of this behaviour with e.g. parallel
> oblivious hash nodes does *NOT* seem apt to me. There's currently
> effectively no cost pressure for/against parallelism for those (even if
> there potentially should). Which means they do not favor parallel
> queries solely because they're allowed to use more memory, and thus it's
> far less likely that every of those nodes uses the maximum alloted
> work_mem.

I agree that there is no cost pressure for or against parallelism.
The design which I have been pursuing with parallel query up to this
point is that cost represents execution time, so minimizing cost means
minimizing execution time, and that's the goal.  If we want, we can
put a tax on parallel query plans, so that they're only chosen when
they are *substantially* cheaper than non-parallel plans, or even that
a plan with a few workers is to be preferred over a plan with more
workers unless the gains are sufficiently substantial.  But I don't
think the right way to do that is to prevent a parallel hash from
using as much memory as a non-parallel hash in the same spot would
use.

Rather, what we could do is, for example, have a knob that multiplies
the cost of a partial path by a floating-point value when we insert a
Gather/Gather Merge node.  If you want to always pick the cheapest
path regardless of whether it uses parallelism, set the GUC to 1.0.
If you only want to pick a parallel query path if it's twice as cheap
as the best non-parallel path, set the GUC to 2.0.

> I think it's wrong to just multiply the amount of work_mem that way, and
> it'll bite use. Introducing a separate guc, perhaps inheriting from
> work_mem if set to -1, that limits the amount of memory inside a
> parallel node seems saner. That value then would not be multiplied with
> the chosen worker number.

I don't mind having an option to override the amount of memory that
parallel hash is allowed to used, but I'm also not yet convinced that
we have a real problem that needs solving.

> As far as I understand it we currently cost a gather's startup cost
> once, not per worker.

Yes - because cost is supposed to measure execution time, and the
workers start all at once, not sequentially.  The startup time for the
workers as a whole doesn't get much longer as the number of workers
rises.  It might be that the formula should be 1000 + 50/worker or
something rather than a constant, but I doubt it matters very much.

> That startup cost effectively include redundant
> work like materializing a table, building parallel oblivious hashtables,
> resorting the same table for a mergejoin etc...

Right, because that work all contributes to total execution time.
Actually, when the same worker is going to be redone in multiple
workers, we should really inflate the cost, because if two workers
each sort the same table, it takes each of them longer than it would
take a single worker to sort the table for itself.  That's one of the
biggest problems with parallel costing right now, from what I've seen:
we're too willing to do the same work over in each of 4-6
participants, not realizing that they're going to content for buffer
locks and I/O bandwidth.

> That's fine if your goal is solely to return a single query as fast as
> possible, even if doubling the resources will only give you a minimal
> cost advantage.  If instead your goal is to optimize both for individual
> query performance and overall system throughput that's obviously not
> good.

I agree.

> I think we should cost the startup cost of paralell nodes more like
> startup_cost * (max_parallel_workers_per_gather * 
> parallel_resource_efficiency + 1)
>
> which'd allow to tune query performance for using parallelism even for
> tiny benefits (parallel_resource_efficiency = 0), and conversely tune it
> so it only gets used if the overall loss of efficiency is small
> (parallel_resource_efficiency = 0.9 or such).

No, I think that's wrong.  I think you need to apply the adjustment at
the end of the (parallel) planning process, not incrementally to each
node.  Otherwise, the costs assigned to the individual nodes becomes
some unholy amalgam of execution time and total r

Re: [HACKERS] Parallel Hash take II

2017-11-16 Thread Andres Freund
Hi,

On 2017-11-15 14:09:13 -0500, Robert Haas wrote:
> On Wed, Nov 15, 2017 at 1:35 PM, Andres Freund  wrote:
> > But this does bug me, and I think it's what made me pause here to make a
> > bad joke.  The way that parallelism treats work_mem makes it even more
> > useless of a config knob than it was before.  Parallelism, especially
> > after this patch, shouldn't compete / be benchmarked against a
> > single-process run with the same work_mem. To make it "fair" you need to
> > compare parallelism against a single threaded run with work_mem *
> > max_parallelism.
> 
> I don't really know how to do a fair comparison between a parallel
> plan and a non-parallel plan.  Even if the parallel plan contains zero
> nodes that use work_mem, it might still use more memory than the
> non-parallel plan, because a new backend uses a bunch of memory.  If
> you really want a comparison that is fair on the basis of memory
> usage, you have to take that into account somehow.

That's not quite what I'm concerned about.  Consider something
(completely artifical) like:

tpch_5[18786][1]=# SET work_mem = '50MB';
tpch_5[18786][1]=# EXPLAIN SELECT c_name, count(*) FROM orders JOIN customer ON 
(o_custkey = c_custkey) WHERE o_orderdate BETWEEN '1995-01-01' AND '1995-01-05' 
GROUP BY 1 ORDER BY count(*) DESC LIMIT 10;
┌───┐
│QUERY PLAN 
│
├───┤
│ Limit  (cost=77344.16..77344.19 rows=10 width=27) 
│
│   ->  Sort  (cost=77344.16..77379.68 rows=14206 width=27) 
│
│ Sort Key: (count(*)) DESC 
│
│ ->  HashAggregate  (cost=76895.12..77037.18 rows=14206 width=27)  
│
│   Group Key: customer.c_name  
│
│   ->  Hash Join  (cost=35347.04..76824.09 rows=14206 width=19)
│
│ Hash Cond: (orders.o_custkey = customer.c_custkey)
│
│ ->  Bitmap Heap Scan on orders  (cost=302.04..41599.74 
rows=14206 width=4)│
│   Recheck Cond: ((o_orderdate >= '1995-01-01'::date) 
AND (o_orderdate <= '1995-01-05'::date)) │
│   ->  Bitmap Index Scan on i_o_orderdate  
(cost=0.00..298.49 rows=14206 width=0)  │
│ Index Cond: ((o_orderdate >= 
'1995-01-01'::date) AND (o_orderdate <= '1995-01-05'::date)) │
│ ->  Hash  (cost=25670.00..25670.00 rows=75 width=23)  
│
│   ->  Seq Scan on customer  (cost=0.00..25670.00 
rows=75 width=23)│
└───┘
(13 rows)

tpch_5[18786][1]=# SET work_mem = '10MB';
tpch_5[18786][1]=# EXPLAIN SELECT c_name, count(*) FROM orders JOIN customer ON 
(o_custkey = c_custkey) WHERE o_orderdate BETWEEN '1995-01-01' AND '1995-01-05' 
GROUP BY 1 ORDER BY count(*) DESC LIMIT 10;
┌─┐
│   QUERY PLAN  
  │
├─┤
│ Limit  (cost=82847.92..82847.94 rows=10 width=27) 
  │
│   ->  Sort  (cost=82847.92..82883.43 rows=14206 width=27) 
  │
│ Sort Key: (count(*)) DESC 
  │
│ ->  HashAggregate  (cost=82398.87..82540.93 rows=14206 width=27)  
  │
│   Group Key: customer.c_name  
  │
│   ->  Merge Join  (cost=42580.44..82327.84 rows=14206 width=19)   
  │
│ Merge Cond: (customer.c_custkey = orders.o_custkey

Re: [HACKERS] Parallel Hash take II

2017-11-15 Thread Thomas Munro
On Thu, Nov 16, 2017 at 8:09 AM, Robert Haas  wrote:
> On Wed, Nov 15, 2017 at 1:35 PM, Andres Freund  wrote:
>> But this does bug me, and I think it's what made me pause here to make a
>> bad joke.  The way that parallelism treats work_mem makes it even more
>> useless of a config knob than it was before.  Parallelism, especially
>> after this patch, shouldn't compete / be benchmarked against a
>> single-process run with the same work_mem. To make it "fair" you need to
>> compare parallelism against a single threaded run with work_mem *
>> max_parallelism.
>
> I don't really know how to do a fair comparison between a parallel
> plan and a non-parallel plan.  Even if the parallel plan contains zero
> nodes that use work_mem, it might still use more memory than the
> non-parallel plan, because a new backend uses a bunch of memory.  If
> you really want a comparison that is fair on the basis of memory
> usage, you have to take that into account somehow.
>
> But even then, the parallel plan is also almost certainly consuming
> more CPU cycles to produce the same results.  Parallelism is all about
> trading away efficiency for execution time.  Not just because of
> current planner and executor limitations, but intrinsically, parallel
> plans are less efficient.  The globally optimal solution on a system
> that is short on either memory or CPU cycles is to turn parallelism
> off.

The guys who worked on the first attempt at Parallel Query for
Berkeley POSTGRES (and then ripped that out, moving to another project
called XPRS which I have found no trace of, perhaps it finished up in
some commercial RDBMS) wrote this[1]:

"The objective function that XPRS uses for query optimization is a
combination of resource consumption and response time as follows:

  cost = resource consumption + w * response time

Here w is a system-specifc weighting factor. A small w mostly
optimizes resource consumption, while a large w mostly optimizes
response time. Resource consumption is measured by the number of disk
pages accessed and number of tuples processed, while response time is
the elapsed time for executing the query."

http://db.cs.berkeley.edu/papers/ERL-M93-28.pdf

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Parallel Hash take II

2017-11-15 Thread Thomas Munro
On Thu, Nov 16, 2017 at 7:57 AM, Peter Geoghegan  wrote:
> The contrast with the situation with Thomas and his hash join patch is
> interesting. Hash join is *much* more sensitive to the availability of
> memory than a sort operation is.
>
>> I don't really have a good answer to "but what should we otherwise do",
>> but I'm doubtful this is quite the right answer.
>
> I think that the work_mem model should be replaced by something that
> centrally budgets memory. It would make sense to be less generous with
> sorts and more generous with hash joins when memory is in short
> supply, for example, and a model like this can make that possible. The
> work_mem model has always forced users to be far too conservative.
> Workloads are very complicated, and always having users target the
> worst case leaves a lot to be desired.

In the old days, Oracle had only simple per-operation memory limits
too, and that applied to every operation in every thread just like our
work_mem.  It's interesting that they had separate knobs for sort and
hash though, and defaulted to giving hash twice as much.

With a whole-plan memory target, our planner would probably begin to
plan join order differently to minimise the number of hash tables in
memory at once, like other RDBMSs.  Not sure how the plan-wide target
should work though -- try many plans, giving different portions of
budget to different subplans?  That should work fine if you like
O(please-melt-my-computer), especially if combined with a similar
approach to choosing worker numbers.  Some kind of feedback system?
Seems like a different kind of planner, but I have no clue.  If you
have ideas/papers/references, it'd be great to see a new thread on
that subject.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Parallel Hash take II

2017-11-15 Thread Thomas Munro
On Thu, Nov 16, 2017 at 7:35 AM, Andres Freund  wrote:
> On 2017-11-15 08:37:11 -0500, Robert Haas wrote:
>> I mean, the very first version of this patch that Thomas submitted was
>> benchmarked by Rafia and had phenomenally good performance
>> characteristics.  That turned out to be because it wasn't respecting
>> work_mem; you can often do a lot better with more memory, and
>> generally you can't do nearly as well with less.  To make comparisons
>> meaningful, they have to be comparisons between algorithms that use
>> the same amount of memory.  And it's not just about testing.  If we
>> add an algorithm that will run twice as fast with equal memory but
>> only allow it half as much, it will probably never get picked and the
>> whole patch is a waste of time.
>
> But this does bug me, and I think it's what made me pause here to make a
> bad joke.  The way that parallelism treats work_mem makes it even more
> useless of a config knob than it was before.  Parallelism, especially
> after this patch, shouldn't compete / be benchmarked against a
> single-process run with the same work_mem. To make it "fair" you need to
> compare parallelism against a single threaded run with work_mem *
> max_parallelism.
>
> Thomas argues that this makes hashjoins be treated faily vis-a-vi
> parallel-oblivious hash join etc. And I think he has somewhat of a
> point. But I don't think it's quite right either: In several of these
> cases the planner will not prefer the multi-process plan because it uses
> more work_mem, it's a cost to be paid. Whereas this'll optimize towards
> using work_mem * max_parallel_workers_per_gather amount of memory.
>
> This makes it pretty much impossible to afterwards tune work_mem on a
> server in a reasonable manner. Previously you'd tune it to something
> like free_server_memory - (max_connections * work_mem *
> 80%_most_complex_query). Which you can't really do anymore now, you'd
> also need to multiply by max_parallel_workers_per_gather. Which means
> that you might end up "forcing" paralellism on a bunch of plans that'd
> normally execute in too short a time to make parallelism worth it.

Currently our way of choosing the number of workers is 'rule based':
we use a simple formula that takes relation sizes and some GUCs and
per-relation options as inputs.  The comparison against non-parallel
plans is cost based of course, but we won't consider any other number
of workers.

Suppose we had 'cost based' worker number selection instead:  simply
try planning with various different worker counts and pick the winner.
Then I think we'd see the moral hazard in this scheme more clearly:
the planner effectively has a free memory printing press.  It will
think that it's a good idea to use a huge number of workers to get
more and more work_mem-sized hash tables or in-memory sorts into
memory (whether that's with partition-wise join, Parallel Hash, or
something else).

We could switch to a model where work_mem is divided by the number of
workers.  Parallel Hash would be able to use a full work_mem by
combining them, and parallel-oblivious Hash would be able to use only
work_mem / participants.  That'd be the other way to give Parallel
Hash a fair amount of memory compared to the competition, but I didn't
propose that because it'd be a change to the already-released
behaviour.  Then I'd have been saying "hey, look at this new plan I
wrote, it performs really well if you first tie the other plans' shoe
laces together".  It may actually be better though, even without
Parallel Hash in the picture.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Parallel Hash take II

2017-11-15 Thread Robert Haas
On Wed, Nov 15, 2017 at 1:35 PM, Andres Freund  wrote:
> But this does bug me, and I think it's what made me pause here to make a
> bad joke.  The way that parallelism treats work_mem makes it even more
> useless of a config knob than it was before.  Parallelism, especially
> after this patch, shouldn't compete / be benchmarked against a
> single-process run with the same work_mem. To make it "fair" you need to
> compare parallelism against a single threaded run with work_mem *
> max_parallelism.

I don't really know how to do a fair comparison between a parallel
plan and a non-parallel plan.  Even if the parallel plan contains zero
nodes that use work_mem, it might still use more memory than the
non-parallel plan, because a new backend uses a bunch of memory.  If
you really want a comparison that is fair on the basis of memory
usage, you have to take that into account somehow.

But even then, the parallel plan is also almost certainly consuming
more CPU cycles to produce the same results.  Parallelism is all about
trading away efficiency for execution time.  Not just because of
current planner and executor limitations, but intrinsically, parallel
plans are less efficient.  The globally optimal solution on a system
that is short on either memory or CPU cycles is to turn parallelism
off.

> Thomas argues that this makes hashjoins be treated faily vis-a-vi
> parallel-oblivious hash join etc. And I think he has somewhat of a
> point. But I don't think it's quite right either: In several of these
> cases the planner will not prefer the multi-process plan because it uses
> more work_mem, it's a cost to be paid. Whereas this'll optimize towards
> using work_mem * max_parallel_workers_per_gather amount of memory.

In general, we have no framework for evaluating the global impact on
the system of our decisions.  Not just with parallelism, but in
general, plans that use memory are going to typically beat plans that
don't, because using more memory is a good way to make things run
faster, so the cost goes down, and the cost is what matters.
Everything optimizes for eating as many resources as possible, even if
there is only an extremely marginal gain over a more costly plan that
uses dramatically fewer resources.

A good example of this is a parallel bitmap heap scan vs. a parallel
sequential scan.  With enough workers, we switch from having one
worker scan the index and then having all workers do a joint scan of
the heap -- to just performing a parallel sequential scan.  Because of
Moore's law, as you add workers, waiting for the non-parallel index
scan to build the bitmap eventually looks less desirable than
accepting that you're going to uselessly scan a lot of pages -- you
stop caring, because you have enough raw power to just blast through
it.

The best non-parallel example I can think of off-hand is sorts.
Sometimes, reducing the amount of memory available for a sort really
doesn't cost very much in terms of execution time, but we always run a
sort with the full allotted work_mem, even if that's gigantic.  We
won't use it all if the data is smaller than work_mem, but if there's
batching going on then we will, even if it doesn't really help.

> This makes it pretty much impossible to afterwards tune work_mem on a
> server in a reasonable manner. Previously you'd tune it to something
> like free_server_memory - (max_connections * work_mem *
> 80%_most_complex_query). Which you can't really do anymore now, you'd
> also need to multiply by max_parallel_workers_per_gather. Which means
> that you might end up "forcing" paralellism on a bunch of plans that'd
> normally execute in too short a time to make parallelism worth it.

I think you just need to use max_connections +
Min(max_parallel_workers, max_worker_processes) instead of
max_connections.  You can't use parallel query for every query at the
same time with reasonable settings...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel Hash take II

2017-11-15 Thread Andres Freund
Hi,

On 2017-11-15 10:57:35 -0800, Peter Geoghegan wrote:
> > I don't really have a good answer to "but what should we otherwise do",
> > but I'm doubtful this is quite the right answer.
> 
> I think that the work_mem model should be replaced by something that
> centrally budgets memory. It would make sense to be less generous with
> sorts and more generous with hash joins when memory is in short
> supply, for example, and a model like this can make that possible. The
> work_mem model has always forced users to be far too conservative.
> Workloads are very complicated, and always having users target the
> worst case leaves a lot to be desired.

Obviously that's nice and worthwhile goal, but it seems more than a bit
out of reach for this patchset.

Greetings,

Andres Freund



Re: [HACKERS] Parallel Hash take II

2017-11-15 Thread Peter Geoghegan
On Wed, Nov 15, 2017 at 10:35 AM, Andres Freund  wrote:
>> I realize you're sort of joking here, but I think it's necessary to
>> care about fairness between pieces of code.
>
> Indeed I kinda was.

When I posted v1 of parallel CREATE INDEX, it followed the hash join
model of giving workMem (maintenance_work_mem) to every worker. Robert
suggested that my comparison with a serial case was therefore not
representative, since I was using much more memory. I actually changed
the patch to use a single maintenance_work_mem share for the entire
operation afterwards, which seemed to work better. And, it made very
little difference to performance for my original benchmark in the end,
so I was arguably wasting memory in v1.

>> I mean, the very first version of this patch that Thomas submitted was
>> benchmarked by Rafia and had phenomenally good performance
>> characteristics.  That turned out to be because it wasn't respecting
>> work_mem; you can often do a lot better with more memory, and
>> generally you can't do nearly as well with less.  To make comparisons
>> meaningful, they have to be comparisons between algorithms that use
>> the same amount of memory.  And it's not just about testing.  If we
>> add an algorithm that will run twice as fast with equal memory but
>> only allow it half as much, it will probably never get picked and the
>> whole patch is a waste of time.

The contrast with the situation with Thomas and his hash join patch is
interesting. Hash join is *much* more sensitive to the availability of
memory than a sort operation is.

> I don't really have a good answer to "but what should we otherwise do",
> but I'm doubtful this is quite the right answer.

I think that the work_mem model should be replaced by something that
centrally budgets memory. It would make sense to be less generous with
sorts and more generous with hash joins when memory is in short
supply, for example, and a model like this can make that possible. The
work_mem model has always forced users to be far too conservative.
Workloads are very complicated, and always having users target the
worst case leaves a lot to be desired.

-- 
Peter Geoghegan



Re: [HACKERS] Parallel Hash take II

2017-11-15 Thread Andres Freund
On 2017-11-15 08:37:11 -0500, Robert Haas wrote:
> On Tue, Nov 14, 2017 at 4:24 PM, Andres Freund  wrote:
> >> I agree, and I am interested in that subject.  In the meantime, I
> >> think it'd be pretty unfair if parallel-oblivious hash join and
> >> sort-merge join and every other parallel plan get to use work_mem * p
> >> (and in some cases waste it with duplicate data), but Parallel Hash
> >> isn't allowed to do the same (and put it to good use).
> >
> > I'm not sure I care about fairness between pieces of code ;)
> 
> I realize you're sort of joking here, but I think it's necessary to
> care about fairness between pieces of code.

Indeed I kinda was.


> I mean, the very first version of this patch that Thomas submitted was
> benchmarked by Rafia and had phenomenally good performance
> characteristics.  That turned out to be because it wasn't respecting
> work_mem; you can often do a lot better with more memory, and
> generally you can't do nearly as well with less.  To make comparisons
> meaningful, they have to be comparisons between algorithms that use
> the same amount of memory.  And it's not just about testing.  If we
> add an algorithm that will run twice as fast with equal memory but
> only allow it half as much, it will probably never get picked and the
> whole patch is a waste of time.

But this does bug me, and I think it's what made me pause here to make a
bad joke.  The way that parallelism treats work_mem makes it even more
useless of a config knob than it was before.  Parallelism, especially
after this patch, shouldn't compete / be benchmarked against a
single-process run with the same work_mem. To make it "fair" you need to
compare parallelism against a single threaded run with work_mem *
max_parallelism.

Thomas argues that this makes hashjoins be treated faily vis-a-vi
parallel-oblivious hash join etc. And I think he has somewhat of a
point. But I don't think it's quite right either: In several of these
cases the planner will not prefer the multi-process plan because it uses
more work_mem, it's a cost to be paid. Whereas this'll optimize towards
using work_mem * max_parallel_workers_per_gather amount of memory.

This makes it pretty much impossible to afterwards tune work_mem on a
server in a reasonable manner. Previously you'd tune it to something
like free_server_memory - (max_connections * work_mem *
80%_most_complex_query). Which you can't really do anymore now, you'd
also need to multiply by max_parallel_workers_per_gather. Which means
that you might end up "forcing" paralellism on a bunch of plans that'd
normally execute in too short a time to make parallelism worth it.


I don't really have a good answer to "but what should we otherwise do",
but I'm doubtful this is quite the right answer.


Greetings,

Andres Freund



Re: [HACKERS] Parallel Hash take II

2017-11-15 Thread Robert Haas
On Tue, Nov 14, 2017 at 4:24 PM, Andres Freund  wrote:
>> I agree, and I am interested in that subject.  In the meantime, I
>> think it'd be pretty unfair if parallel-oblivious hash join and
>> sort-merge join and every other parallel plan get to use work_mem * p
>> (and in some cases waste it with duplicate data), but Parallel Hash
>> isn't allowed to do the same (and put it to good use).
>
> I'm not sure I care about fairness between pieces of code ;)

I realize you're sort of joking here, but I think it's necessary to
care about fairness between pieces of code.

I mean, the very first version of this patch that Thomas submitted was
benchmarked by Rafia and had phenomenally good performance
characteristics.  That turned out to be because it wasn't respecting
work_mem; you can often do a lot better with more memory, and
generally you can't do nearly as well with less.  To make comparisons
meaningful, they have to be comparisons between algorithms that use
the same amount of memory.  And it's not just about testing.  If we
add an algorithm that will run twice as fast with equal memory but
only allow it half as much, it will probably never get picked and the
whole patch is a waste of time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel Hash take II

2017-11-14 Thread Andres Freund
Hi,

On 2017-11-14 01:30:30 +1300, Thomas Munro wrote:
> > +-- The "good" case: batches required, but we plan the right number; we
> > +-- plan for 16 batches, and we stick to that number, and peak memory
> > +-- usage says within our work_mem budget
> > +-- non-parallel
> > +set max_parallel_workers_per_gather = 0;
> > +set work_mem = '128kB';
> >
> > So how do we know that's actually the case we're testing rather than
> > something arbitrarily different? There's IIRC tests somewhere that just
> > filter the json explain output to the right parts...
> 
> Yeah, good idea.  My earlier attempts to dump out the hash join
> dimensions ran into problems with architecture sensitivity and then
> some run-to-run non-determinism in the parallel case (due to varying
> fragmentation depending on how many workers get involved in time).
> The attached version tells you about batch growth without reporting
> the exact numbers, except in the "ugly" case where we know that there
> is only one possible outcome because the extreme skew detector is
> guaranteed to go off after the first nbatch increase (I got rid of all
> other tuples except ones with the same key to make this true).

Hm. The way you access this doesn't quite seem right:
+--
+-- exercises for the hash join code
+--
+begin;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+-- Extract bucket and batch counts from an explain analyze plan.  In
+-- general we can't make assertions about how many batches (or
+-- buckets) will be required because it can vary, but we can in some
+-- special cases and we can check for growth.
+create or replace function hash_join_batches(query text)
+returns table (original int, final int) language plpgsql
+as
+$$
+declare
+  line text;
+  matches text[];
+begin
+  for line in
+execute 'explain analyze ' || query
+  loop
+matches := (regexp_matches(line, '  Batches: ([0-9]+) \(originally 
([0-9]+)\)'));
+if matches is not null then
+  original := matches[2]::int;
+  final := matches[1]::int;
+  return next;
+else
+  matches := regexp_matches(line, '  Batches: ([0-9]+)');
+  if matches is not null then
+original := matches[1]::int;
+final := original;
+return next;
+  end if;
+end if;
+  end loop;
+end;
+$$;

Why not use format json and access the output that way? Then you can be
sure you access the right part of the tree and such?

> > +   else
> > +   {
> > +   errno = stat_errno;
> > +   elog(LOG, "could not stat file \"%s\": %m", path);
> > +   }
> >
> > All these messages are "not expected to ever happen" ones, right?
> 
> You'd have to suffer a nasty filesystem failure, remount read-only or
> manually with permissions or something.  Not sure where the line is,
> but I've changed all of these new elog calls to ereport.

Oh, I'd been fine keeping them as elogs. The one exception would have
been out-of-space cases which'll occur in practice.


> > +   if (vfdP->fdstate & FD_TEMP_FILE_LIMIT)
> > +   {
> > +   /* Subtract its size from current usage (do first in case 
> > of error) */
> > +   temporary_files_size -= vfdP->fileSize;
> > +   vfdP->fileSize = 0;
> > +   }
> >
> > So, is it right to do so unconditionally and without regard for errors?
> > If the file isn't deleted, it shouldn't be subtracted from fileSize. I
> > guess you're managing that through the flag, but that's not entirely
> > obvious.
> 
> I think it is.  Reasoning:  The existing behaviour of fd.c is that if
> we don't manage to delete temporary files, we'll LOG something and
> forget about them (they'll be cleaned up eventually by a clean restart
> or human intervention).

IOW: Never ;)


> > +/*
> > + * Write a tuple.  If a meta-data size was provided to sts_initialize, 
> > then a
> > + * pointer to meta data of that size must be provided.
> > + */
> > +void
> > +sts_puttuple(SharedTuplestoreAccessor *accessor, void *meta_data,
> > +MinimalTuple tuple)
> > +{
> >
> > +   /* Do we have space? */
> > +   size = accessor->sts->meta_data_size + tuple->t_len;
> > +   if (accessor->write_pointer + size >= accessor->write_end)
> > +   {
> > +   /* Try flushing to see if that creates enough space. */
> > +   if (accessor->write_chunk != NULL)
> > +   sts_flush_chunk(accessor);
> > +
> > +   /*
> > +* It may still not be enough in the case of a gigantic 
> > tuple, or if
> > +* we haven't created a chunk buffer at all yet.
> > +*/
> > +   if (accessor->write_pointer + size >= accessor->write_end)
> > +   {
> > +   SharedTuplestoreParticipant *participant;
> > +   size_t  space_needed;
> > +   int pages_needed;
> > +
> > +   /* How many pages to