On Wed, Apr 5, 2017 at 3:07 PM, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: > > > On 04/05/2017 09:05 AM, Kuntal Ghosh wrote: >> >> AFAICU, during crash recovery, we wait for all non-syslogger children >> to exit, then reset shmem(call BackgroundWorkerShmemInit) and perform >> StartupDataBase. While starting the startup process we check if any >> bgworker is scheduled for a restart. If a bgworker is crashed and not >> meant for a restart(parallel worker in our case), we call >> ForgetBackgroundWorker() to discard it. >> > > OK, so essentially we reset the counters, getting > > parallel_register_count = 0 > parallel_terminate_count = 0 > > and then ForgetBackgroundWworker() runs and increments the terminate_count, > breaking the logic? > > And you're using (parallel_register_count > 0) to detect whether we're still > in the init phase or not. Hmmm. > Yes. But, as Robert suggested up in the thread, we should not use (parallel_register_count = 0) as an alternative to define a bgworker crash. Hence, I've added an argument named 'wasCrashed' in ForgetBackgroundWorker to indicate a bgworker crash.
>>> >>> In any case, the comment right before BackgroundWorkerArray says this: >>> >>> * These counters can of course overflow, but it's not important here >>> * since the subtraction will still give the right number. >>> >>> which means that this assert >>> >>> + Assert(BackgroundWorkerData->parallel_register_count >= >>> + BackgroundWorkerData->parallel_terminate_count); >>> >>> is outright broken, just like any other attempts to rely on simple >>> comparisons of these two counters, no? >>> >> IIUC, the assert statements ensures that register count should always >> be greater than or equal to the terminate count. >> Whereas, the comment refers to the fact that register count and >> terminate count indicate the total number of parallel workers spawned >> and terminated respectively since the server has been started. At >> every session, we're not resetting the counts, hence they can >> overflow. But, their substraction should give you the number of active >> parallel worker at any instance. >> So, I'm not able to see how the assert is broken w.r.t the aforesaid >> comment. Am I missing something here? >> > > The comment says that the counters are allowed to overflow, i.e. after a > long uptime you might get these values > > parallel_register_count = UINT_MAX + 1 = 1 > parallel_terminate_count = UINT_MAX > > which is fine, because the C handles the overflow during subtraction and so > > parallel_register_count - parallel_terminate_count = 1 > > But the assert is not doing subtraction, it's comparing the values directly: > > Assert(parallel_register_count >= parallel_terminate_count); > > and the (perfectly valid) values trivially violate this comparison. > Thanks for the explanation. So, we can't use the above assert statement. Even the following assert statement will not be helpful: Assert(parallel_register_count - parallel_terminate_count >= 0); Because, it'll fail to track the scenario when parallel_register_count is not overflowed, still less than parallel_terminate_count. :( It seems to me the only thing we can make sure is (parallel_register_count == parallel_terminate_count == 0) in ForgetBackgroundWorker in case of a crash. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers