On 2/25/22 17:54, Pepijn Noltes wrote:
Hi Peng,
To start off, I see that some emails are directly mailed to me. I
think for future mails these discussions can go to the celix dev
mailing list.
Oops, I kept ctrl+shift+Ring without noticing this ;p
Maybe it is good to explain some history about the Celix event thread
and async api.
Celix started with implementing the OSGi api from the specification as
directly as possible.
But this gave some issues:
1) No real protected of services going away and their memory being
deallocated or even their bundle libraries being "dlclosed".
This led to the introduction of the "use services" approach instead of
the Java BundleContext.getServiceReference and the hiding of service
references for the user.
2) C (pthread) mutexes are not the same as java synchronized
Essentially (if I am correct) Java synchronized works as a recursive
mutex. For Celix we did not want to promote the use of recursive
mutexes (the whole recursive mutexes are evil discussion).
But using normal mutexes in combination with service trackers which
triggers service usage in their "add" callbacks quickly lead to deadlocks.
This is especially true for something like the remote services. e.g.
in the same thread: remote service found (RSA mutex) -> tracker add
callback called for component -> component registered remote services
-> new remote service added in RSA tracker callback (RSA mutex) ->
deadlock.
Note that for Java OSGi it is rarer to have issues with this, but they
still occur (murphy's law). Mostly as a current modification
exception, because synchronized (recursive mutexes) do not protect the
expected invariants.
This led to the introduction of handling service registration,
creating service trackers and using services one Celix event thread
and an async api next to the sync api.
In my experience this solved quite some issues we had, but is also not
perfect.
Especially because a combination of sync/async api is now possible,
this can lead to "solving" a deadlock issue in one spot only to have
it pop up again a bit further.
And to be honest I do not know how we can tackle this in a better way.
The sync api is sometimes needed, for example if a component loses 1
of its required dependencies it needs to stop (remove its own
services) before the required dependency service can finish removing,
this requires a sync unregister service call. But to prevent deadlock
issues for the user, ideally all service registration, service
unregistration events are done async.
I do have 1 though: Create a Celix event thread pool (instead of a
single thread), which can spin up new threads if needed to prevent
events being blocked by other events.
But I am not sure if this really solves the issue and whether it is
feasible to detect if an event needs to be handled on another thread.
It also has an impact on the footprint of Celix.
In short I think the whole 1 Celix event thread, combination of
sync/async should be kept as-is and released for now. But I think it
would be wise to rethink the future of Celix and maybe try thinking
about a Celix 3 release with a major update/refactoring of how MT
issues are handled.
The event loop design is elegant, and solves a lot of problems.
What I'm really against is the behavior of
celix_bundleContext_useServiceWithOptions.
IMHO, the event loop is a very precious resource, which should only run
non-blocking fast administrative tasks, i.e. the event loop should be
for framework usage only.
I also notices bundle install/uninstall is already offload to separate
thread, which is really nice. This is quite similar to libuv's treatment
of file operations (throwing them into thread pool).
For high level application development, services often do things like
performing network/file IO, querying database, CPU-intensive tasks, to
name a few.
These rarely have to mess with the event loop. Calling them on the event
loop should be avoided at all costs IMO.
However, as an easy-to-use API, the current default behavior of
celix_bundleContext_useServiceWithOptions does encourage its users to do
such things.
On Wed, Feb 23, 2022 at 2:00 PM Peng Zheng <zheng_p...@qq.com> wrote:
On 2/23/22 19:13, Peng Zheng wrote:
On 2/23/22 18:41, Peng Zheng wrote:
psa->topicSenders.mutex is another example of external tracker
lock, which is locked by pubsub_zmqAdmin_removeProtocolSvc.
Noting that bundle is always stopped by a non-event-loop
thread, the following call chain will also lead to dead lock by
the observed pattern.
psa_zmq_stop->pubsub_zmqAdmin_destroy(holing
psa->topicSenders.mutex)->pubsub_zmqTopicSender_destroy->pubsubInterceptorsHandler_destroy->celix_bundleContext_stopTracker
Sorry, this example is wrong. psa_zmq_stop already stops all
trackers.
Replacing sync api with async version solve the first deadlock.
Then comes a second ABBA deadlock:
futex_abstimed_wait_cancelable 0x00007fd5efc6e7b1
__pthread_cond_wait_common 0x00007fd5efc6e7b1
__pthread_cond_timedwait 0x00007fd5efc6e7b1
celixThreadCondition_timedwaitRelative celix_threads.c:196
celix_framework_waitForGenericEvent framework.c:2729
celix_bundleContext_useServiceWithOptions bundle_context.c:1228
pubsub_serializerHandler_createForMarkerService
pubsub_serializer_handler.c:169
pubsub_zmqAdmin_getSerializationHandler pubsub_zmq_admin.c:759
pubsub_zmqAdmin_setupTopicSender pubsub_zmq_admin.c:390
pstm_setupTopicSenderCallback pubsub_topology_manager.c:903
celix_serviceTracker_useHighestRankingService service_tracker.c:760
celix_bundleContext_useServiceWithOptions_2_UseServiceTracker
bundle_context.c:1187
celix_bundleContext_useServiceWithOptions bundle_context.c:1222
celix_bundleContext_useServiceWithId bundle_context.c:1132
pstm_setupTopicSenders pubsub_topology_manager.c:977
pstm_psaHandlingThread pubsub_topology_manager.c:1135
start_thread 0x00007fd5efc67609
clone 0x00007fd5ef840293
futex_wait_cancelable 0x00007fd5efc6e376
__pthread_cond_wait_common 0x00007fd5efc6e376
__pthread_cond_wait 0x00007fd5efc6e376
celixThreadCondition_wait celix_threads.c:173
tracked_waitAndDestroy service_tracker.c:82
serviceTracker_untrackTracked service_tracker.c:576
serviceTracker_untrack service_tracker.c:541
serviceTracker_serviceChanged service_tracker.c:354
celix_serviceRegistry_serviceChanged service_registry.c:1131
serviceRegistry_unregisterService service_registry.c:272
serviceRegistration_unregister service_registration.c:147
celix_serviceRegistry_unregisterService service_registry.c:1212
fw_handleEventRequest framework.c:1545
fw_handleEvents framework.c:1595
fw_eventDispatcher framework.c:1621
start_thread 0x00007fd5efc67609
clone 0x00007fd5ef840293
The first task, which is using "pubsub_admin", prevents the celix
event loop (now waiting for pubsub_admin users vanishing)
processing other event, which in turn prevents the first task
making any progress.
I didn't expect the provided service do such nontrivial thing as
calling celix_bundleContext_useServiceWithOptions.
Calling service on caller's thread should work if the service
itself does not mess with the event loop. But obviously this
assumption does not hold.
I'd better leave celix_bundleContext_useServiceWithOptions
untouched. So I'll revert the last commit.
The PR should be considered complete and ready for review.
Yeah I recognise this. Especially the whole PubSubAdmin and
RemoteServiceAdmin approach tends to trigger deadlock quite quickly.
Maybe this can be solved by redesign and not allowing the "use
services" call in these situations (responsibility of the bundle
programmers).
WDYT?
The usage of celix_bundleContext_useServiceWithOptions by
PubSubAdmin/RemoteServiceAdmin does fall into framework administrative
usage, and they are not slow blocking operations.
So I'm OK with them either.
To avoid breaking existing users, I choose to add additional behavior
without affecting the default.
--
Peng Zheng
--
Peng Zheng