Re: [HACKERS] WIP: [[Parallel] Shared] Hash

Andres Freund Sun, 26 Mar 2017 14:48:00 -0700

On 2017-03-23 20:35:09 +1300, Thomas Munro wrote:
> Here is a new patch series responding to feedback from Peter and Andres:


+
+/* Per-participant shared state. */
+typedef struct SharedTuplestoreParticipant
+{
+       LWLock lock;

Hm. No padding (ala LWLockMinimallyPadded / LWLockPadded) - but that's
probably ok, for now.




+       bool error;                                     /* Error occurred flag. 
*/
+       bool eof;                                       /* End of file reached. 
*/
+       int read_fileno;                        /* BufFile segment file number. 
*/
+       off_t read_offset;                      /* Offset within segment file. 
*/

Hm. I wonder if it'd not be better to work with 64bit offsets, and just
separate that out upon segment access.

+/* The main data structure in shared memory. */

"main data structure" isn't particularly meaningful.

+struct SharedTuplestore
+{
+       int reading_partition;
+       int nparticipants;
+       int flags;

Maybe add a comment saying /* flag bits from SHARED_TUPLESTORE_* */?

+       Size meta_data_size;

What's this?

+       SharedTuplestoreParticipant participants[FLEXIBLE_ARRAY_MEMBER];

I'd add a comment here, that there's further data after participants.

+};

+
+/* Per-participant backend-private state. */
+struct SharedTuplestoreAccessor
+{

Hm. The name and it being backend-local are a bit conflicting.

+       int participant;                        /* My partitipant number. */
+       SharedTuplestore *sts;          /* The shared state. */
+       int nfiles;                                     /* Size of local files 
array. */
+       BufFile **files;                        /* Files we have open locally 
for writing. */

Shouldn't this mention that it's indexed by partition?

+       BufFile *read_file;                     /* The current file to read 
from. */
+       int read_partition;                     /* The current partition to 
read from. */
+       int read_participant;           /* The current participant to read 
from. */
+       int read_fileno;                        /* BufFile segment file number. 
*/
+       off_t read_offset;                      /* Offset within segment file. 
*/
+};


+/*
+ * Initialize a SharedTuplestore in existing shared memory.  There must be
+ * space for sts_size(participants) bytes.  If flags is set to the value
+ * SHARED_TUPLESTORE_SINGLE_PASS then each partition may only be read once,
+ * because underlying files will be deleted.

Any reason not to use flags that are compatible with tuplestore.c?

+ * Tuples that are stored may optionally carry a piece of fixed sized
+ * meta-data which will be retrieved along with the tuple.  This is useful for
+ * the hash codes used for multi-batch hash joins, but could have other
+ * applications.
+ */
+SharedTuplestoreAccessor *
+sts_initialize(SharedTuplestore *sts, int participants,
+                          int my_participant_number,
+                          Size meta_data_size,
+                          int flags,
+                          dsm_segment *segment)
+{

Not sure I like that the naming here has little in common with
tuplestore.h's api.


+
+MinimalTuple
+sts_gettuple(SharedTuplestoreAccessor *accessor, void *meta_data)
+{

This needs docs.

+       SharedBufFileSet *fileset = GetSharedBufFileSet(accessor->sts);
+       MinimalTuple tuple = NULL;
+
+       for (;;)
+       {

...
+               /* Check if this participant's file has already been entirely 
read. */
+               if (participant->eof)
+               {
+                       BufFileClose(accessor->read_file);
+                       accessor->read_file = NULL;
+                       LWLockRelease(&participant->lock);
+                       continue;

Why are we closing the file while holding the lock?

+
+               /* Read the optional meta-data. */
+               eof = false;
+               if (accessor->sts->meta_data_size > 0)
+               {
+                       nread = BufFileRead(accessor->read_file, meta_data,
+                                                               
accessor->sts->meta_data_size);
+                       if (nread == 0)
+                               eof = true;
+                       else if (nread != accessor->sts->meta_data_size)
+                               ereport(ERROR,
+                                               (errcode_for_file_access(),
+                                                errmsg("could not read from 
temporary file: %m")));
+               }
+
+               /* Read the size. */
+               if (!eof)
+               {
+                       nread = BufFileRead(accessor->read_file, &tuple_size, 
sizeof(tuple_size));
+                       if (nread == 0)
+                               eof = true;

Why is it legal to have EOF here, if metadata previously didn't have an
EOF? Perhaps add an error if accessor->sts->meta_data_size != 0?


+               if (eof)
+               {
+                       participant->eof = true;
+                       if ((accessor->sts->flags & 
SHARED_TUPLESTORE_SINGLE_PASS) != 0)
+                               SharedBufFileDestroy(fileset, 
accessor->read_partition,
+                                                                        
accessor->read_participant);
+
+                       participant->error = false;
+                       LWLockRelease(&participant->lock);
+
+                       /* Move to next participant's file. */
+                       BufFileClose(accessor->read_file);
+                       accessor->read_file = NULL;
+                       continue;
+               }
+
+               /* Read the tuple. */
+               tuple = (MinimalTuple) palloc(tuple_size);
+               tuple->t_len = tuple_size;

Hm. Constantly re-allocing this doesn't strike me as a good idea (not to
mention that the API doesn't mention this is newly allocated).  Seems
like it'd be a better idea to have a per-accessor buffer where this can
be stored in - increased in size when necessary.


- Andres


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: [[Parallel] Shared] Hash

Reply via email to