Re: [HACKERS] strange parallel query behavior after OOM crashes

Amit Kapila Wed, 05 Apr 2017 06:20:13 -0700

On Wed, Apr 5, 2017 at 12:35 PM, Kuntal Ghosh
<[email protected]> wrote:
> On Tue, Apr 4, 2017 at 11:22 PM, Tomas Vondra
> <[email protected]> wrote:
>> On 04/04/2017 06:52 PM, Robert Haas wrote:
>>>
>>> On Mon, Apr 3, 2017 at 6:08 AM, Kuntal Ghosh <[email protected]>
>>> wrote:
>>>>
>>>> On Fri, Mar 31, 2017 at 6:50 PM, Robert Haas <[email protected]>
>>>> wrote:
>>>>>
>>>>> On Thu, Mar 30, 2017 at 4:35 PM, Kuntal Ghosh
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> 2. the server restarts automatically, initialize
>>>>>> BackgroundWorkerData->parallel_register_count and
>>>>>> BackgroundWorkerData->parallel_terminate_count in the shared memory.
>>>>>> After that, it calls ForgetBackgroundWorker and it increments
>>>>>> parallel_terminate_count.
>>>>>
>>>>>
>>>>> Hmm.  So this seems like the root of the problem.  Presumably those
>>>>> things need to be reset AFTER forgetting any background workers from
>>>>> before the crash.
>>>>>
>>>> IMHO, the fix would be not to increase the terminated parallel worker
>>>> count whenever ForgetBackgroundWorker is called due to a bgworker
>>>> crash. I've attached a patch for the same. PFA.
>>>
>>>
>>> While I'm not opposed to that approach, I don't think this is a good
>>> way to implement it.  If you want to pass an explicit flag to
>>> ForgetBackgroundWorker telling it whether or not it should performing
>>> the increment, fine.  But with what you've got here, you're
>>> essentially relying on "spooky action at a distance".  It would be
>>> easy for future code changes to break this, not realizing that
>>> somebody's got a hard dependency on 0 having a specific meaning.
>>>
>>
>> I'm probably missing something, but I don't quite understand how these
>> values actually survive the crash. I mean, what I observed is OOM followed
>> by a restart, so shouldn't BackgroundWorkerShmemInit() simply reset the
>> values back to 0? Or do we call ForgetBackgroundWorker() after the crash for
>> some reason?
> AFAICU, during crash recovery, we wait for all non-syslogger children
> to exit, then reset shmem(call BackgroundWorkerShmemInit) and perform
> StartupDataBase. While starting the startup process we check if any
> bgworker is scheduled for a restart.
>


In general, your theory appears right, but can you check how it
behaves in standby server because there is a difference in how the
startup process behaves during master and standby startup?  In master,
it stops after recovery whereas in standby it will keep on running to
receive WAL.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] strange parallel query behavior after OOM crashes

Reply via email to