On 2017-03-23 20:35:09 +1300, Thomas Munro wrote: > Here is a new patch series responding to feedback from Peter and Andres:
+ +/* Per-participant shared state. */ +typedef struct SharedTuplestoreParticipant +{ + LWLock lock; Hm. No padding (ala LWLockMinimallyPadded / LWLockPadded) - but that's probably ok, for now. + bool error; /* Error occurred flag. */ + bool eof; /* End of file reached. */ + int read_fileno; /* BufFile segment file number. */ + off_t read_offset; /* Offset within segment file. */ Hm. I wonder if it'd not be better to work with 64bit offsets, and just separate that out upon segment access. +/* The main data structure in shared memory. */ "main data structure" isn't particularly meaningful. +struct SharedTuplestore +{ + int reading_partition; + int nparticipants; + int flags; Maybe add a comment saying /* flag bits from SHARED_TUPLESTORE_* */? + Size meta_data_size; What's this? + SharedTuplestoreParticipant participants[FLEXIBLE_ARRAY_MEMBER]; I'd add a comment here, that there's further data after participants. +}; + +/* Per-participant backend-private state. */ +struct SharedTuplestoreAccessor +{ Hm. The name and it being backend-local are a bit conflicting. + int participant; /* My partitipant number. */ + SharedTuplestore *sts; /* The shared state. */ + int nfiles; /* Size of local files array. */ + BufFile **files; /* Files we have open locally for writing. */ Shouldn't this mention that it's indexed by partition? + BufFile *read_file; /* The current file to read from. */ + int read_partition; /* The current partition to read from. */ + int read_participant; /* The current participant to read from. */ + int read_fileno; /* BufFile segment file number. */ + off_t read_offset; /* Offset within segment file. */ +}; +/* + * Initialize a SharedTuplestore in existing shared memory. There must be + * space for sts_size(participants) bytes. If flags is set to the value + * SHARED_TUPLESTORE_SINGLE_PASS then each partition may only be read once, + * because underlying files will be deleted. Any reason not to use flags that are compatible with tuplestore.c? + * Tuples that are stored may optionally carry a piece of fixed sized + * meta-data which will be retrieved along with the tuple. This is useful for + * the hash codes used for multi-batch hash joins, but could have other + * applications. + */ +SharedTuplestoreAccessor * +sts_initialize(SharedTuplestore *sts, int participants, + int my_participant_number, + Size meta_data_size, + int flags, + dsm_segment *segment) +{ Not sure I like that the naming here has little in common with tuplestore.h's api. + +MinimalTuple +sts_gettuple(SharedTuplestoreAccessor *accessor, void *meta_data) +{ This needs docs. + SharedBufFileSet *fileset = GetSharedBufFileSet(accessor->sts); + MinimalTuple tuple = NULL; + + for (;;) + { ... + /* Check if this participant's file has already been entirely read. */ + if (participant->eof) + { + BufFileClose(accessor->read_file); + accessor->read_file = NULL; + LWLockRelease(&participant->lock); + continue; Why are we closing the file while holding the lock? + + /* Read the optional meta-data. */ + eof = false; + if (accessor->sts->meta_data_size > 0) + { + nread = BufFileRead(accessor->read_file, meta_data, + accessor->sts->meta_data_size); + if (nread == 0) + eof = true; + else if (nread != accessor->sts->meta_data_size) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not read from temporary file: %m"))); + } + + /* Read the size. */ + if (!eof) + { + nread = BufFileRead(accessor->read_file, &tuple_size, sizeof(tuple_size)); + if (nread == 0) + eof = true; Why is it legal to have EOF here, if metadata previously didn't have an EOF? Perhaps add an error if accessor->sts->meta_data_size != 0? + if (eof) + { + participant->eof = true; + if ((accessor->sts->flags & SHARED_TUPLESTORE_SINGLE_PASS) != 0) + SharedBufFileDestroy(fileset, accessor->read_partition, + accessor->read_participant); + + participant->error = false; + LWLockRelease(&participant->lock); + + /* Move to next participant's file. */ + BufFileClose(accessor->read_file); + accessor->read_file = NULL; + continue; + } + + /* Read the tuple. */ + tuple = (MinimalTuple) palloc(tuple_size); + tuple->t_len = tuple_size; Hm. Constantly re-allocing this doesn't strike me as a good idea (not to mention that the API doesn't mention this is newly allocated). Seems like it'd be a better idea to have a per-accessor buffer where this can be stored in - increased in size when necessary. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers