Re: ABBA deadlock

Peng Zheng Fri, 25 Feb 2022 03:00:47 -0800

On 2/25/22 17:54, Pepijn Noltes wrote:

Hi Peng,
To start off, I see that some emails are directly mailed to me. Ithink for future mails these discussions can go to the celix devmailing list.


Oops, I kept ctrl+shift+Ring without noticing this ;p

Maybe it is good to explain some history about the Celix event threadand async api.
Celix started with implementing the OSGi api from the specification asdirectly as possible.
But this gave some issues:
1) No real protected of services going away and their memory beingdeallocated or even their bundle libraries being "dlclosed".This led to the introduction of the "use services" approach instead ofthe Java BundleContext.getServiceReference and the hiding of servicereferences for the user.
2) C (pthread) mutexes are not the same as java synchronized
Essentially (if I am correct) Java synchronized works as a recursivemutex. For Celix we did not want to promote the use of recursivemutexes (the whole recursive mutexes are evil discussion).But using normal mutexes in combination with service trackers whichtriggers service usage in their "add" callbacks quickly lead to deadlocks.This is especially true for something like the remote services. e.g.in the same thread: remote service found (RSA mutex) -> tracker addcallback called for component -> component registered remote services-> new remote service added in RSA tracker callback (RSA mutex) ->deadlock.
Note that for Java OSGi it is rarer to have issues with this, but theystill occur (murphy's law). Mostly as a current modificationexception, because synchronized (recursive mutexes) do not protect theexpected invariants.
This led to the introduction of handling service registration,creating service trackers and using services one Celix event threadand an async api next to the sync api.In my experience this solved quite some issues we had, but is also notperfect.Especially because a combination of sync/async api is now possible,this can lead to "solving" a deadlock issue in one spot only to haveit pop up again a bit further.
And to be honest I do not know how we can tackle this in a better way.
The sync api is sometimes needed, for example if a component loses 1of its required dependencies it needs to stop (remove its ownservices) before the required dependency service can finish removing,this requires a sync unregister service call. But to prevent deadlockissues for the user, ideally all service registration, serviceunregistration events are done async.
I do have 1 though: Create a Celix event thread pool (instead of asingle thread), which can spin up new threads if needed to preventevents being blocked by other events.But I am not sure if this really solves the issue and whether it isfeasible to detect if an event needs to be handled on another thread. It also has an impact on the footprint of Celix.
In short I think the whole 1 Celix event thread, combination ofsync/async should be kept as-is and released for now. But I think itwould be wise to rethink the future of Celix and maybe try thinkingabout a Celix 3 release with a major update/refactoring of how MTissues are handled.


The event loop design is elegant, and solves a lot of problems.

What I'm really against is the behavior ofcelix_bundleContext_useServiceWithOptions.IMHO, the event loop is a very precious resource, which should only runnon-blocking fast administrative tasks, i.e. the event loop should befor framework usage only.I also notices bundle install/uninstall is already offload to separatethread, which is really nice. This is quite similar to libuv's treatmentof file operations (throwing them into thread pool).For high level application development, services often do things likeperforming network/file IO, querying database, CPU-intensive tasks, toname a few.These rarely have to mess with the event loop. Calling them on the eventloop should be avoided at all costs IMO.However, as an easy-to-use API, the current default behavior ofcelix_bundleContext_useServiceWithOptions does encourage its users to dosuch things.



On Wed, Feb 23, 2022 at 2:00 PM Peng Zheng <zheng_p...@qq.com> wrote:

    On 2/23/22 19:13, Peng Zheng wrote:

    On 2/23/22 18:41, Peng Zheng wrote:


    psa->topicSenders.mutex is another example of external tracker
    lock, which is locked by pubsub_zmqAdmin_removeProtocolSvc.
    Noting that bundle is always stopped by a non-event-loop
    thread, the following call chain will also lead to dead lock by
    the observed pattern.
    psa_zmq_stop->pubsub_zmqAdmin_destroy(holing
    
psa->topicSenders.mutex)->pubsub_zmqTopicSender_destroy->pubsubInterceptorsHandler_destroy->celix_bundleContext_stopTracker

    Sorry, this example is wrong. psa_zmq_stop already stops all
    trackers.

    Replacing sync api with async version solve the first deadlock.
    Then comes a second ABBA deadlock:

    futex_abstimed_wait_cancelable 0x00007fd5efc6e7b1
    __pthread_cond_wait_common 0x00007fd5efc6e7b1
    __pthread_cond_timedwait 0x00007fd5efc6e7b1
    celixThreadCondition_timedwaitRelative celix_threads.c:196
    celix_framework_waitForGenericEvent framework.c:2729
    celix_bundleContext_useServiceWithOptions bundle_context.c:1228
    pubsub_serializerHandler_createForMarkerService 
pubsub_serializer_handler.c:169
    pubsub_zmqAdmin_getSerializationHandler pubsub_zmq_admin.c:759
    pubsub_zmqAdmin_setupTopicSender pubsub_zmq_admin.c:390
    pstm_setupTopicSenderCallback pubsub_topology_manager.c:903
    celix_serviceTracker_useHighestRankingService service_tracker.c:760
    celix_bundleContext_useServiceWithOptions_2_UseServiceTracker 
bundle_context.c:1187
    celix_bundleContext_useServiceWithOptions bundle_context.c:1222
    celix_bundleContext_useServiceWithId bundle_context.c:1132
    pstm_setupTopicSenders pubsub_topology_manager.c:977
    pstm_psaHandlingThread pubsub_topology_manager.c:1135
    start_thread 0x00007fd5efc67609
    clone 0x00007fd5ef840293

    futex_wait_cancelable 0x00007fd5efc6e376
    __pthread_cond_wait_common 0x00007fd5efc6e376
    __pthread_cond_wait 0x00007fd5efc6e376
    celixThreadCondition_wait celix_threads.c:173
    tracked_waitAndDestroy service_tracker.c:82
    serviceTracker_untrackTracked service_tracker.c:576
    serviceTracker_untrack service_tracker.c:541
    serviceTracker_serviceChanged service_tracker.c:354
    celix_serviceRegistry_serviceChanged service_registry.c:1131
    serviceRegistry_unregisterService service_registry.c:272
    serviceRegistration_unregister service_registration.c:147
    celix_serviceRegistry_unregisterService service_registry.c:1212
    fw_handleEventRequest framework.c:1545
    fw_handleEvents framework.c:1595
    fw_eventDispatcher framework.c:1621
    start_thread 0x00007fd5efc67609
    clone 0x00007fd5ef840293

    The first task, which is using "pubsub_admin", prevents the celix
    event loop (now waiting for pubsub_admin users vanishing)
    processing other event, which in turn prevents the first task
    making any progress.

    I didn't expect the provided service do such nontrivial thing as
    calling celix_bundleContext_useServiceWithOptions.

    Calling service on caller's thread should work if the service
    itself does not mess with the event loop. But obviously this
    assumption does not hold.
    I'd better leave celix_bundleContext_useServiceWithOptions
    untouched. So I'll revert the last commit.
    The PR should be considered complete and ready for review.

Yeah I recognise this. Especially the whole PubSubAdmin andRemoteServiceAdmin approach tends to trigger deadlock quite quickly.Maybe this can be solved by redesign and not allowing the "useservices" call in these situations (responsibility of the bundleprogrammers).


WDYT?

The usage of celix_bundleContext_useServiceWithOptions byPubSubAdmin/RemoteServiceAdmin does fall into framework administrativeusage, and they are not slow blocking operations.

So I'm OK with them either.

To avoid breaking existing users, I choose to add additional behaviorwithout affecting the default.

--Peng Zheng


--
Peng Zheng

Re: ABBA deadlock

Reply via email to