On 12/07/2015 09:25 PM, Josh Berkus wrote: > On 12/07/2015 11:32 AM, Josh Berkus wrote: >> On 12/07/2015 10:56 AM, Josh Berkus wrote: >>> So, the prepare clone method above worked perfectly twice. But then we >>> tried to bring up a new node as a prepared clone from node 11 and things >>> went to hell. >> >> One thing I just realized was different between the first two, >> successful, runs and the failed runs: the first two times, we didn't >> have pg_hba.conf configured, so when we brought up slony on the new node >> it couldn't connect until we fixed that. >> >> So I'm wondering if there's a timing issue here somewhere. > > So, this problem was less interesting than I thought. As it turns out, > the sysadmin was handling "make sure slony doesn't start on the server" > by letting it autostart, then shutting it down. In the couple minutes > it was running, though, it did enough to prevent finish clone from working. >
I wonder if there is more going on here In remoteWorker_event We have if (node->last_event >= ev_seqno) { rtcfg_unlock(); slon_log(SLON_DEBUG2, "remoteWorker_event: event %d," INT64_FORMAT " ignored - duplicate\n", ev_origin, ev_seqno); return; } /* * We lock the worker threads message queue before bumping the nodes last * known event sequence to avoid that another listener queues a later * message before we can insert this one. */ pthread_mutex_lock(&(node->message_lock)); node->last_event = ev_seqno; rtcfg_unlock(); It seems strange to me that we are obtaining the mutex lock after checking node->last_event. Does the rtcfg_lock prevent the race condition making the direct message_lock redundent? If not do we need to obtain the node->message_lock before we do the comparision? The CLONE_NODE handler in remote_worker sets last_event by calling rtcfg_getNodeLastEvent which obtains the rtcfg_lock but not the message lock. The clone node handler in remote_worker seems to do this 1. call rtcfg_storeNode (which obtains then releases the config lock) 2. calls cloneNodePrepare_int() 3. queries the last event id 4. calls rtcfg_getNodeLastEvent() which would re-obtain then release the config lock I wonder if sometime after step 1 but before step 4 a remote listener queries events from the new node and adds them into the queue because the last_event hasn't yet been set. Maybe cloneNodePrepare needs to obtain the message queue lock at step 1 and hold it until step 4 and then remoteWorker_event needs to obtain that lock a bit earlier _______________________________________________ Slony1-general mailing list Slony1-general@lists.slony.info http://lists.slony.info/mailman/listinfo/slony1-general