On 2/25/22 17:54, Pepijn Noltes wrote:
Hi Peng,

To start off, I see that some emails are directly mailed to me. I think for future mails these discussions can go to the celix dev mailing list.

Oops, I kept ctrl+shift+Ring without noticing this ;p


Maybe it is good to explain some history about the Celix event thread and async api.

Celix started with implementing the OSGi api from the specification as directly as possible.
But this gave some issues:
1) No real protected of services going away and their memory being deallocated or even their bundle libraries being "dlclosed". This led to the introduction of the "use services" approach instead of the Java BundleContext.getServiceReference and the hiding of service references for the user.


2) C (pthread) mutexes are not the same as java synchronized
Essentially (if I am correct) Java synchronized works as a recursive mutex. For Celix we did not want to promote the use of recursive mutexes (the whole recursive mutexes are evil discussion). But using normal mutexes in combination with service trackers which triggers service usage in their "add" callbacks quickly lead to deadlocks. This is especially true for something like the remote services. e.g. in the same thread:  remote service found (RSA mutex) -> tracker add callback called for component -> component registered remote services -> new remote service added in RSA tracker callback (RSA mutex) -> deadlock.

Note that for Java OSGi it is rarer to have issues with this, but they still occur (murphy's law). Mostly as a current modification exception, because synchronized (recursive mutexes) do not protect the expected invariants.

This led to the introduction of handling service registration, creating service trackers and using services one Celix event thread and an async api next to the sync api. In my experience this solved quite some issues we had, but is also not perfect. Especially because a combination of sync/async api is now possible, this can lead to "solving" a deadlock issue in one spot only to have it pop up again a bit further.

And to be honest I do not know how we can tackle this in a better way.
The sync api is sometimes needed, for example if a component loses 1 of its required dependencies it needs to stop (remove its own services) before the required dependency service can finish removing, this requires a sync unregister service call. But to prevent deadlock issues for the user, ideally all service registration, service unregistration events are done async.

I do have 1 though: Create a Celix event thread pool (instead of a single thread), which can spin up new threads if needed to prevent events being blocked by other events. But I am not sure if this really solves the issue and whether it is feasible to detect if an event needs to be handled on another thread.  It also has an impact on the footprint of Celix.

In short I think the whole 1 Celix event thread, combination of sync/async should be kept as-is and released for now. But I think it would be wise to rethink the future of Celix and maybe try thinking about a Celix 3 release with a major update/refactoring of how MT issues are handled.

The event loop design is elegant, and solves a lot of problems.
What I'm really against is the behavior of celix_bundleContext_useServiceWithOptions. IMHO, the event loop is a very precious resource, which should only run non-blocking fast administrative tasks, i.e. the event loop should be for framework usage only. I also notices bundle install/uninstall is already offload to separate thread, which is really nice. This is quite similar to libuv's treatment of file operations (throwing them into thread pool). For high level application development, services often do things like performing network/file IO, querying database, CPU-intensive tasks, to name a few. These rarely have to mess with the event loop. Calling them on the event loop should be avoided at all costs IMO. However, as an easy-to-use API, the current default behavior of celix_bundleContext_useServiceWithOptions does encourage its users to do such things.



On Wed, Feb 23, 2022 at 2:00 PM Peng Zheng <zheng_p...@qq.com> wrote:

    On 2/23/22 19:13, Peng Zheng wrote:
    On 2/23/22 18:41, Peng Zheng wrote:

    psa->topicSenders.mutex is another example of external tracker
    lock, which is locked by pubsub_zmqAdmin_removeProtocolSvc.
    Noting that bundle is always stopped by a non-event-loop
    thread, the following call chain will also lead to dead lock by
    the observed pattern.
    psa_zmq_stop->pubsub_zmqAdmin_destroy(holing
    
psa->topicSenders.mutex)->pubsub_zmqTopicSender_destroy->pubsubInterceptorsHandler_destroy->celix_bundleContext_stopTracker

    Sorry, this example is wrong. psa_zmq_stop already stops all
    trackers.

    Replacing sync api with async version solve the first deadlock.
    Then comes a second ABBA deadlock:

    futex_abstimed_wait_cancelable 0x00007fd5efc6e7b1
    __pthread_cond_wait_common 0x00007fd5efc6e7b1
    __pthread_cond_timedwait 0x00007fd5efc6e7b1
    celixThreadCondition_timedwaitRelative celix_threads.c:196
    celix_framework_waitForGenericEvent framework.c:2729
    celix_bundleContext_useServiceWithOptions bundle_context.c:1228
    pubsub_serializerHandler_createForMarkerService 
pubsub_serializer_handler.c:169
    pubsub_zmqAdmin_getSerializationHandler pubsub_zmq_admin.c:759
    pubsub_zmqAdmin_setupTopicSender pubsub_zmq_admin.c:390
    pstm_setupTopicSenderCallback pubsub_topology_manager.c:903
    celix_serviceTracker_useHighestRankingService service_tracker.c:760
    celix_bundleContext_useServiceWithOptions_2_UseServiceTracker 
bundle_context.c:1187
    celix_bundleContext_useServiceWithOptions bundle_context.c:1222
    celix_bundleContext_useServiceWithId bundle_context.c:1132
    pstm_setupTopicSenders pubsub_topology_manager.c:977
    pstm_psaHandlingThread pubsub_topology_manager.c:1135
    start_thread 0x00007fd5efc67609
    clone 0x00007fd5ef840293

    futex_wait_cancelable 0x00007fd5efc6e376
    __pthread_cond_wait_common 0x00007fd5efc6e376
    __pthread_cond_wait 0x00007fd5efc6e376
    celixThreadCondition_wait celix_threads.c:173
    tracked_waitAndDestroy service_tracker.c:82
    serviceTracker_untrackTracked service_tracker.c:576
    serviceTracker_untrack service_tracker.c:541
    serviceTracker_serviceChanged service_tracker.c:354
    celix_serviceRegistry_serviceChanged service_registry.c:1131
    serviceRegistry_unregisterService service_registry.c:272
    serviceRegistration_unregister service_registration.c:147
    celix_serviceRegistry_unregisterService service_registry.c:1212
    fw_handleEventRequest framework.c:1545
    fw_handleEvents framework.c:1595
    fw_eventDispatcher framework.c:1621
    start_thread 0x00007fd5efc67609
    clone 0x00007fd5ef840293

    The first task, which is using "pubsub_admin", prevents the celix
    event loop (now waiting for pubsub_admin users vanishing)
    processing other event, which in turn prevents the first task
    making any progress.

    I didn't expect the provided service do such nontrivial thing as
    calling celix_bundleContext_useServiceWithOptions.

    Calling service on caller's thread should work if the service
    itself does not mess with the event loop. But obviously this
    assumption does not hold.
    I'd better leave celix_bundleContext_useServiceWithOptions
    untouched. So I'll revert the last commit.
    The PR should be considered complete and ready for review.


Yeah I recognise this. Especially the whole PubSubAdmin and RemoteServiceAdmin approach tends to trigger deadlock quite quickly. Maybe this can be solved by redesign and not allowing the "use services" call in these situations (responsibility of the bundle programmers).

WDYT?
The usage of celix_bundleContext_useServiceWithOptions by PubSubAdmin/RemoteServiceAdmin does fall into framework administrative usage, and they are not slow blocking operations.
So I'm OK with them either.

To avoid breaking existing users, I choose to add additional behavior without affecting the default.

-- Peng Zheng


--
Peng Zheng

Reply via email to