Re: [tipc-discussion] [net-next PATCH 1/2] tipc: prevent swork item from queuing workqueue after connection is closed

Jon Maloy Mon, 08 Jan 2018 07:32:27 -0800

Ying, do you have any comment on this? 

///jon



> -----Original Message-----
> From: Jon Maloy
> Sent: Monday, December 11, 2017 12:14
> To: Jon Maloy <[email protected]>; Ying Xue
> <[email protected]>; [email protected]
> Cc: [email protected]
> Subject: RE: [net-next PATCH 1/2] tipc: prevent swork item from queuing
> workqueue after connection is closed
> 
> Thinking more about this, I am puzzled how 9dc3abdd1f7e ("tipc: fix
> nametbl_lock soft lockup at module exit") could happen at all.
> When nametbl_stop() is run, after topsrv_stop(), all existing subscriptions
> are supposed to have been cleaned out already, and a call chain like the one
> described should not be possible.
> I think this is another issue we need to look into.
> 
> ///jon
> 
> 
> > -----Original Message-----
> > From: Jon Maloy [mailto:[email protected]]
> > Sent: Monday, December 11, 2017 11:52
> > To: Ying Xue <[email protected]>; [email protected]
> > Cc: [email protected]
> > Subject: Re: [tipc-discussion] [net-next PATCH 1/2] tipc: prevent
> > swork item from queuing workqueue after connection is closed
> >
> >
> >
> > > -----Original Message-----
> > > From: Ying Xue [mailto:[email protected]]
> > > Sent: Monday, December 11, 2017 06:57
> > > To: Jon Maloy <[email protected]>; [email protected]
> > > Cc: [email protected]
> > > Subject: Re: [net-next PATCH 1/2] tipc: prevent swork item from
> > > queuing workqueue after connection is closed
> > >
> > > On 12/06/2017 09:03 PM, Jon Maloy wrote:
> > > >
> > > >
> > > >> -----Original Message-----
> > > >> From: Ying Xue [mailto:[email protected]]
> > > >> Sent: Wednesday, December 06, 2017 03:29
> > > >> To: Jon Maloy <[email protected]>; [email protected]
> > > >> Cc: [email protected]
> > > >> Subject: Re: [net-next PATCH 1/2] tipc: prevent swork item from
> > > >> queuing workqueue after connection is closed
> > > >>
> > > >>
> > > >>
> > > >> On 12/06/2017 03:31 AM, Jon Maloy wrote:
> > > >>>> If you in remove the connection item from the idr list *before*
> > > >>>> you flush the queues and remove the socket, there is no risk
> > > >>>> that this will happen.
> > > >>>> If tipc_conn_sendmsg() obtains a pointer, we can be certain
> > > >>>> that the queue has not been flushed yet, and that this will not
> > > >>>> happen until
> > > >>>> conn_put() is called.
> > > >>>> And, you won't need to use the callback lock or touch the
> > > >>>> socket at all
> > > >> here.
> > > >>> It's even better: if you introduce the
> > > >>> !kref_get_unless_zero(&con->kref))
> > > >> test, there is no way tipc_conn_sendmsg() will obtain a pointer
> > > >> while
> > > >> tipc_conn_kref_release() is being executed, i.e., after the queue
> > > >> is
> > > flushed.
> > > >>> I suggest you make both these changes.
> > > >>
> > > >>
> > > >> Please let me use below call chains to show what the race condition is:
> > > >>
> > > >>   CPU 1:                                       CPU 2:
> > > >> ==========                                 ===============
> > > >> T1: tipc_conn_sendmsg
> > > >> T2:  con = tipc_conn_lookup()
> > > >> T3:  test_bit(CF_CONNECTED, &con->flags)
> > > >> T4:   =====> interrupted by another thread;
> > > >> T5:                                        tipc_close_conn()
> > > >> T6:                                        
> > > >> test_and_clear_bit(con->flag)
> > > >> T7:                                        tipc_unregister_callbacks()
> > > >> T8:                                        flush_work(&con->swork)
> > > >> T9:                                        flush_work(&con->rwork)
> > > >> T10: queue_work(s->send_wq, &con->swork)
> > > >>
> > > >>
> > > >> The key point is that the whole process of identifying whether
> > > >> conn is/being closed with test_bit(CF_CONNECTED, &con->flags) or
> > > >> sk->sk_user_data and queuing swork item into workqueue is not
> > atomic.
> > > >
> > > > That is true. But looking at the kref value is, because if that
> > > > value is zero
> > > nobody else has a pointer to the connection item.
> > >
> > > Yes, but it's unrelated to the case.
> > >
> > >  And vice versa, if it is non-zero the queues cannot not have been
> > > flushed yet.
> > >
> > > In the case, at T4, for example, we has used
> > > kref_get_unless_zero(&con->kref) to confirm that con refcnt is
> > > really not zero, so the checking result shows that we can queue a
> > > swork into workqueue. But at the moment, tipc_conn_sendmsg() is
> > > interrupted. At
> > > T5
> > > tipc_close_conn() is called on CPU2, and con CF_CONNECTED flag is
> > > cleared, sk_user_data is set to null, and works have been flushed on
> > > the CPU subsequently.
> >
> > I was of course assuming that the queues are flushed in
> > tipc_conn_kref_release() as I suggested. Then this wouldn't be a problem.
> >
> > >
> > > When tipc_conn_sendmsg() comes back at T10, as it is totally unaware
> > > that CF_CONNECTED con->flags has been cleared and works have been
> > > flushed, a swork item will still be queued to a flushed workqueue.
> > > As a result, con may be finally destroyed in the context of calling
> >
> > [...]
> >
> > > >                 flush_work(&con->swork);
> > > >                 flush_work(&con->rwork);
> > >
> > > Sorry, we cannot move flush works to tipc_conn_kref_release(),
> > > otherwise, that name table soft lockup solved by 9dc3abdd1f7e ("tipc:
> > > fix nametbl_lock soft lockup at module exit") may occur again
> > > because tipc_conn_kref_release may be called in
> > > tipc_nametbl_withdraw() rather than tipc_topsrv_stop() context.
> >
> > It is both possible and necessary, as I see it. The key to achieve
> > this is to move all deletes, i.e., all calls to xxx_put() of any kind,
> > to the receiving context of either work queue. Never in an interrupt
> > context, and never from the event trigging side of the send work queue.
> > If you look at my patch #4(v3) that I sent out two weeks ago I do
> > exactly this, but for a different reason; because this might happen
> > even at a subscription timeout event, although with less fatal
> consequences.
> >
> > You don't have to eliminate the tipc_subscriber structure to achieve
> > this, as I did, (although I strongly recommend it; but that can be
> > done later). You only need to send some hint along with the
> > subscription event that this is the final event for this subscription,
> > and that it should be deleted, in work queue context, after the event has
> been posted on the socket (or dropped).
> >
> > >
> > > Moreover, before a work item is queued into workqueue, con refcnt
> > > needs to be first held. So when tipc_conn_kref_release() is called,
> > > it means there is no any user for a con. As a consequence, there is
> > > no any work item which are still queued in workqueues, so we don't
> > > need to
> > flush works at all.
> >
> > Even better. So, we never need to flush any work queues, only the send
> > event queue.
> >
> > Regards
> > ///jon
> >
> > >
> > > >                 saddr->scope = -TIPC_NODE_SCOPE;
> > > >                 kernel_bind(sock, (struct sockaddr *)saddr, 
> > > > sizeof(*saddr));
> > > >                 if (con->conid)
> > > >                         s->tipc_conn_release(con->conid, con->usr_data);
> > > >                 sock_release(sock);
> > > >                 con->sock = NULL;
> > > >         }
> > > >         tipc_clean_outqueues(con);
> > > >         kfree(con);
> > > > }
> > > >
> > > > ///jon
> > > >
> > > >> Even if all checking for connection status are well done and
> > > >> connection is still connected before swork item is queued, if the
> > > >> thread of running
> > > >> tipc_conn_sendmsg() is interrupted at T4 and it's backed at T10,
> > > >> a swork might be queued into workqueue after its queue has been
> > > >> flushed in
> > > >> tipc_close_conn() on CPU 2.
> > > >>
> > > >> That's why we need use sk_callback_lock read lock to guarantee
> > > >> that the entire process of identifying connection status and
> > > >> queuing swork item cannot be interrupted.
> > > >>
> > > >>>
> > > >>> ///jon
> > > >>>
> > ----------------------------------------------------------------------
> > -------- Check out the vibrant tech community on one of the world's
> > most engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> > _______________________________________________
> > tipc-discussion mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/tipc-discussion

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Re: [tipc-discussion] [net-next PATCH 1/2] tipc: prevent swork item from queuing workqueue after connection is closed

Reply via email to