Very sorry for the late response Jon.

On 12/12/2017 12:52 AM, Jon Maloy wrote:
>>>                 flush_work(&con->swork);
>>>                 flush_work(&con->rwork);
>> Sorry, we cannot move flush works to tipc_conn_kref_release(), otherwise,
>> that name table soft lockup solved by 9dc3abdd1f7e ("tipc:
>> fix nametbl_lock soft lockup at module exit") may occur again because
>> tipc_conn_kref_release may be called in tipc_nametbl_withdraw() rather
>> than tipc_topsrv_stop() context.
> It is both possible and necessary, as I see it. The key to achieve this is to 
> move all deletes, i.e., all calls to xxx_put() of any kind, to the receiving 
> context of either work queue. Never in an interrupt context, and never from 
> the event trigging side of the send work queue.

In fact, if we enforce so many limitations where we cannot release "con"
object by calling con_put(), that means that kref variable declared in
con structure will lose its real purpose.

The current design model of TIPC internal topology server just follows
the design philosophy of many other internal servers implemented in
other subsystems, such as nfs, drdb, dlm and ocfs2 etc. In other words,
the design is not a new invention.

In this design model, any event/connection request is asynchronously
handled. It means before we use any object, we have to first increase
its refcnt, and we must decrease its refcnt when we don't use it any
more. But no matter which context we are sitting in, we can just
decrease its refcnt anywhere.

Although in our TIPC module, our situation might be a bit more complex
than other components, I don't think it needs a radical overhaul. In my
opinion, I cannot find any other potential issue existing in topology
server after the two patches are applied.

Of course, when designing the TIPC server, there were two internal
servers: topology server and remote configuration server. That's why we
abstracted the one common server, and the two TIPC internal servers just
implemented their private interfaces exposed by the common server. Now
the remote configuration server had been deleted. So we can remove some
redundant code from tipc server. But once one day we have a new
requirement to implement another special internal server, we may have to
return back to the current design.

Lastly, we ever solved several critical issues occurred in TIPC server
many times, but after my careful analysis, we did not completely
understand their real root causes in that solutions.

In your solution, we significantly expanded the sk->sk_callback_lock
protection scope, and we introduced another new lock - sub_lock.
Instead, RCU lock is almost meaningless. However, although RCU lock is a
bit more complex than RW lock, it has the two significant advantages:
- Performance is good.
- Almost be immune to deadlock risk.

We ever spent much time dropping several RW locks designed in TIPC 1.7.7
and other older versions. But now it seems more and more RW/spin locks
were introduced again. This is not a good trend.

In all, I still think the current locking policy of tipc server has no
any problem in principle. So it's better that we should follow the
current design model to improve TIPC server code.

If you agree to my idea above, I would merge the two patches into one
patch and add your suggestions commented in patch #1, and submit another
new version patch in this mail list. If the patch is fine, I will submit
it to net-next as soon as possible. After that, we can continue to
further simplify the TIPC server code with the patches of your series
("tipc: simplify topology server").

Thanks,
Ying

> If you look at my patch #4(v3) that I sent out two weeks ago I do exactly 
> this, but for a different reason; because this might happen even at a 
> subscription timeout event, although with less fatal consequences.
> 
> You don't have to eliminate the tipc_subscriber structure to achieve this, as 
> I did, (although I strongly recommend it; but that can be done later). You 
> only need to send some hint along with the subscription event that this is 
> the final event for this subscription, and that it should be deleted, in work 
> queue context, after the event has been posted on the socket (or dropped).
> 
>> Moreover, before a work item is queued into workqueue, con refcnt needs
>> to be first held. So when tipc_conn_kref_release() is called, it means there 
>> is
>> no any user for a con. As a consequence, there is no any work item which are
>> still queued in workqueues, so we don't need to flush works at all.
> Even better. So, we never need to flush any work queues, only the send event 
> queue.
> 
> Regards
> ///jon
> 

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to