On Fri, 27 Oct 2017 at 13:06:32 -0400, Nicholas D Steeves wrote: > Would using memleax help? I'm 90% done packaging it, in case it looks > useful. Packaged because vagrant can't attach to an existing process.
>From the other information you've given I can tell you with reasonable certainty that (at least a large part of) the "leaked" memory is not leaked, but is really still in use - the dbus-daemon is genuinely trying to process millions of messages to kded5. I can't say why, though. > Is it possible for dbus-daemon to signal to the calling application > that a queue is full? In principle it might have been if it was defined 10 years ago, but there is no protocol defined for communicating that fact, and it isn't clear what an application would do about it anyway. Popping up messages to annoy the user is generally frowned upon if there is nothing that most users can do about it... The dbus 1.11.x development branch (from which a 1.12.0 stable version will be released soon) does log this information to syslog or the systemd Journal, but 1.10.x didn't. The system bus (which has finite limits) responds to excessive messages in an incoming queue by slowing down reading from that sender, and responds to excessive messages in an outgoing queue by dropping messages on the floor. It might be a good idea for the session bus to have a rather smaller limit on its incoming queues than the 1 GB limit that it has now, perhaps the 127 MiB that the system bus uses (which tbh could probably also be reduced) - then floods of messages would get slowed down, but eventually be processed. Individual messages are allowed to exceed that limit by up to 1 message, so maximally-sized messages could still be processed anyway. > :1.7.stats: variant uint32 1000000016 That is indeed slightly more than the arbitrary 1 GB limit that the session bus has. The limit is conceptually only there to stop byte counts overflowing a signed 32-bit integer, because session bus users are trusted to not carry out deliberate denial-of-service attacks on each other; but I think there would probably be value in turning down the incoming message limit a bit to prevent accidental denial-of-service. That would throttle reading from very spammy peers, but wouldn't help when dealing with peers that read messages slower than they should. Turning down the *outgoing* message limit is dangerous (it results in messages being discarded unrecoverably) so we have to be very cautious about reducing it. > > dbus-send --session --dest=org.freedesktop.DBus --type=method_call \ > > --print-reply /org/freedesktop/DBus \ > > org.freedesktop.DBus.Debug.Stats.GetStats > > I tried this just now and dbus-send seems to have errored with exit > status 130: > Must use org.mydomain.Interface.Method notation, no dot in > "/org/freedesktop/DBus" You have an extra argument somewhere, possibly a space after one of the backslashes. > Plasma's NM frontend wouldn't reconnect to the > network, and I before I realised it I had killed plasmashell ; sleep > 3; kstart plasmashell. The dbus-daemon and kded5 daemon seem to be > unaffected, as expected, but maybe this causes problems with dbus? Terminating random processes shouldn't be a problem in general (other than to those processes and anything that was relying on them), but if some component has a bug and has responded to the unexpected disappearance by going into a busy-loop, it's possible that the busy-loop could result in it sending a flood of messages. Or the inability to reconnect to the network might itself have been a symptom of a flood of messages, rather than part of the cause. However, I will say this: if you have a serious problem (like a key desktop component getting stuck), and you respond to it by killing individual components rather than by terminating the whole session (e.g. logging out and back in), then you are certainly in uncharted territory. This is the sort of thing that in principle ought to work, but it is never going to be systematically tested, so it is not at all surprising if it has or exposes bugs. > IIRC, a couple of days ago kded5 consumed 100% > of one core for quite some time. Maybe that's when it was making all > of these dbus connections? Lots of messages rather than lots of connections, I think, but most likely yes. > I'm honestly not sure if kded5 is truly at > fault, of if it became overwhelmed when an application spammed [one > of] its dbus interfaces. It could be either one. > Can we keep these two bugs unmerged but associate them somehow? As far as I know there is no machine-readable way to do this. Just mention each one in a mail to the other. Looking at the dbus-daemon stats: > string "ListMemPoolUsedBytes" > variant uint32 947456112 > string "ListMemPoolCachedBytes" > variant uint32 3086736 > string "ListMemPoolAllocatedBytes" > variant uint32 1610612352 The dbus-daemon keeps some memory pools of pre-allocated linked-list links and messages ready for reuse, in your case 1.6G of them. If your dbus-daemon is really processing millions of messages then this seems plausible to be real use rather than a leak: lots of places in dbus-daemon have to pre-allocate linked-list links in a very pessimistic way so that they can be used in error-recovery without worrying about whether memory can be allocated for them, because dbus-daemon has a policy of assuming that malloc() can legitimately return NULL. I have often wondered whether these cache is more harmful than helpful in the general case, but the worst-case answer is clearly: yes it's more harmful than helpful (because allocated memory doesn't get freed). I'm not sure whether that peak memory consumption would ever get returned to the OS after the heap had grown in any case, though. Last time I experimented with turning off the message cache the answer seemed to be that it wasn't. > string "ActiveConnections" > variant uint32 76 > string "IncompleteConnections" > variant uint32 0 > string "BusNames" > variant uint32 171 > string "PeakBusNames" > variant uint32 191 > string "PeakBusNamesPerConnection" > variant uint32 7 Those look normal. > string "MatchRules" > variant uint32 1168 > string "PeakMatchRules" > variant uint32 1273 > string "PeakMatchRulesPerConnection" > variant uint32 132 132 match rules per connection seem a bit on the high side, but I think QtDBus adds really specific match rules, so this is probably fine. Looking at connection :1.7: org.kde.StatusNotifierWatcher, org.kde.plasmanetworkmanagement, org.kde.keyboard, org.kde.kded5, org.kde.kcookiejar5, org.kde.apperd are really all the same connection (you can tell by how they all share UniqueName = :1.7) and they are the worst problem here. Either kded5 has not been reading its dbus messages from its socket for a long time, or something has sent it far too many messages, or both. While I recognise the engineering tradeoffs that lead to bundling several daemons into one process, this might be more robust (or at least more debuggable) if it was a group of smaller services. > string "UniqueName" > variant string ":1.7" > string "OutgoingMessages" > variant uint32 13157884 > string "OutgoingBytes" > variant uint32 1000000016 > string "PeakOutgoingBytes" > variant uint32 1000000016 That's a lot. The dbus-daemon *currently* has 13 million messages totalling 1GB queued to be sent to :1.7, and that's the most there have ever been at a time. I wonder whether kded has stopped reading messages (causing them to be stuck in the dbus-daemon until it disconnects) or whether there is just a massively large backlog that the dbus-daemon is still working through? > string "UniqueName" > variant string ":1.33" Is there a reason you captured stats for :1.33 (korgac)? It seems to be a really light D-Bus user - hundreds of bytes, ever, so certainly non-problematic. > string "UniqueName" > variant string ":1.334" > string "PeakOutgoingBytes" > variant uint32 7842592 :1.334 is org.kde.kwrite-17177. A peak of 7M waiting to be sent to kwrite is higher than I would usually expect, but certainly not terrible. > string "UniqueName" > variant string ":1.115" > string "PeakOutgoingBytes" > variant uint32 7763452 This is org.freedesktop.Notifications aka org.kde.plasmashell aka :1.115. Again, a peak of 7M waiting to be sent to it is a little high but certainly not excessive. smcv