The last email you said it reproduced. Now you say it did not. I'm confused.
On 10/12/2011 07:13 PM, Wengang Wang wrote: > On 11-10-12 19:11, Sunil Mushran wrote: >> That's what ovm does. Have you reproduced it with ovm3 kernel? >> > No, I have no reproductions. > > thanks, > wengang. >> On 10/12/2011 07:07 PM, Wengang Wang wrote: >>> On 11-10-13 09:51, Wengang Wang wrote: >>>> On 11-10-12 18:47, Sunil Mushran wrote: >>>>> I meant master_request (not query). We set refmap _before_ >>>>> asserting. So that should not happen. >>>> Why can't the remote node requested deref (DLM_DEREF_LOCKRES_MSG)? >>> The problem can easily happen on this dlmfs useage: >>> >>> reopen: >>> open(create) /dlm/dirxx/filexx >>> close /dlm/dirxx/filexx >>> sleep 60 >>> goto reopen >>> >>>> thanks, >>>> wengang. >>>>> On 10/12/2011 06:02 PM, Wengang Wang wrote: >>>>>> Hi Sunil, >>>>>> >>>>>> On 11-10-12 17:32, Sunil Mushran wrote: >>>>>>> So you are saying a lockres can get purged before the node is asserting >>>>>>> master to other nodes? >>>>>>> >>>>>>> The main place where we dispatch assert is during master_query. >>>>>>> There we set refmap before dispatching. Meaning refmap will protect >>>>>>> us from purging. >>>>>>> >>>>>>> But I think it could happen in master_requery, which only comes into >>>>>>> play if a node dies during migration. >>>>>>> >>>>>>> Is that the case here? >>>>>> I think this can mainly include the response for a master_request. >>>>>> in dlm_master_request_handler(), the master node quques assert_master. >>>>>> The node which requested a master_request knows the master by receving >>>>>> response values. It doesn't need to wait until the assert_master come. >>>>>> As you know, the asserting master work is done in a workqueue. And the >>>>>> work item in it can be heavily delayed. So in the duriation from the >>>>>> (old) master responding with "Yes, I am master" to it sending >>>>>> assert_master, >>>>>> Anything can heppan, the worse case is the lockres on the (old) master >>>>>> get purged and is remasted by another node. So in this case, >>>>>> apparently, the old master shouldn't send the assert_master any longer. >>>>>> To prevent that from happening, we should keep the lockres un-purged as >>>>>> long as it's queued for master_request. >>>>>> >>>>>> #the problem is what my flush_workqueue patch tries to fix. >>>>>> >>>>>> thanks, >>>>>> wengang. >>>>>> >>>>>>> On 10/12/2011 12:04 AM, Wengang Wang wrote: >>>>>>>> Hi Sunil/Joel/Mark and anyone who has interest, >>>>>>>> >>>>>>>> This is not a patch but a discuss. >>>>>>>> >>>>>>>> Currently we have a problem: >>>>>>>> When a lockres is still queued(in dlm->work_list) for sending an >>>>>>>> assert_master(or in processing of sending), the lockres can't be >>>>>>>> purged(removed from hash). there is no flag/state,on lockres its >>>>>>>> self,dinotes >>>>>>>> this situation. >>>>>>>> >>>>>>>> The badness is that if the lockres is purged(surely not the owner at >>>>>>>> the >>>>>>>> moment), and the assert_master is after the purge. it can confuse other >>>>>>>> nodes. On another node, the owner now can be any other nodes, thus on >>>>>>>> receiving the assert_master, it can trigger a BUG() because 'owner' >>>>>>>> doesn't match. >>>>>>>> >>>>>>>> So we'd better to prevent the lockres from be purged when it's queued >>>>>>>> for something(assert_master). >>>>>>>> >>>>>>>> Srini and I discussed some possible fixes: >>>>>>>> 1) adding a flag to lockres->state. >>>>>>>> this does not work. A lockres can have multiple instances in the >>>>>>>> queue list. >>>>>>>> A simple flag is not safe. And the instances are not nested, so >>>>>>>> even >>>>>>>> saving a previous flags doesn't work. Neither can we merge the >>>>>>>> instances >>>>>>>> because they can be for different purposes. >>>>>>>> >>>>>>>> 2) checking if the lockres if queued before purging it. >>>>>>>> this works, but doesn't sounds good. it needs changes of current >>>>>>>> behaviour >>>>>>>> on the queue list. Also, we have no idea on the performance of >>>>>>>> the checking >>>>>>>> (searching list). >>>>>>>> >>>>>>>> 3) making use of lockres->inflight_locks. >>>>>>>> this works, but seems to be a mis-use of inflight_locks. >>>>>>>> >>>>>>>> 4) adding a new member to lockres counting the queued time. >>>>>>>> this works and simple. but needs extra memory. >>>>>>>> >>>>>>>> I prefer to the 4). >>>>>>>> >>>>>>>> What's your idea? >>>>>>>> >>>>>>>> thanks, >>>>>>>> wengang. >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Ocfs2-devel mailing list >>>>>>>> Ocfs2-devel@oss.oracle.com >>>>>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-devel _______________________________________________ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-devel