On Mon, Sep 14, 2015 at 3:09 PM, Shulgin, Oleksandr <
oleksandr.shul...@zalando.de> wrote:

> On Mon, Sep 14, 2015 at 2:11 PM, Tomas Vondra <
> tomas.von...@2ndquadrant.com> wrote:
>
>>
>>> Now the backend that has been signaled on the second call to
>>> pg_cmdstatus (it can be either some other backend, or the backend B
>>> again) will not find an unprocessed slot, thus it will not try to
>>> attach/detach the queue and the backend A will block forever.
>>>
>>> This requires a really bad timing and the user should be able to
>>> interrupt the querying backend A still.
>>>
>>
>> I think we can't rely on the low probability that this won't happen, and
>> we should not rely on people interrupting the backend. Being able to detect
>> the situation and fail gracefully should be possible.
>>
>> It may be possible to introduce some lock-less protocol preventing such
>> situations, but it's not there at the moment. If you believe it's possible,
>> you need to explain and "prove" that it's actually safe.
>>
>> Otherwise we may need to introduce some basic locking - for example we
>> may introduce a LWLock for each slot, and lock it with dontWait=true (and
>> skip it if we couldn't lock it). This should prevent most scenarios where
>> one corrupted slot blocks many processes.
>
>
> OK, I will revisit this part then.
>

I have a radical proposal to remove the need for locking: make the
CmdStatusSlot struct consist of a mere dsm_handle and move all the required
metadata like sender_pid, request_type, etc. into the shared memory segment
itself.

If we allow the only the requesting process to update the slot (that is the
handle value itself) this removes the need for locking between sender and
receiver.

The sender will walk through the slots looking for a non-zero dsm handle
(according to dsm_create() implementation 0 is considered an invalid
handle), and if it finds a valid one, it will attach and look inside, to
check if it's destined for this process ID.  At first that might sound
strange, but I would expect 99% of the time that the only valid slot would
be for the process that has been just signaled.

The sender process will then calculate the response message, update the
result_code in the shared memory segment and finally send the message
through the queue.  If the receiver has since detached we get a detached
result code and bail out.

Clearing the slot after receiving the message should be the requesting
process' responsibility.  This way the receiver only writes to the slot and
the sender only reads from it.

By the way, is it safe to assume atomic read/writes of dsm_handle
(uint32)?  I would be surprised if not.

--
Alex

Reply via email to