I see this as a combination of two problems: 1) We're spamming the end user with "whatever's in the status-history collection" rather than presenting a digest tuned for their needs. 2) Important messages get thrown away way too early, because we don't know which messages are important.
I think the pocket/transient/expiry solutions boil down to "let's make the charmer decide what's important", and I don't think that will help. The charmer is only sending those messages *because she believes they're important*; even if we had "perfect" trimming heuristics for the end user, we do the *charmer* a disservice by leaving them no record of what their charm actually did. And, more generally: *every* message we throw away makes it hard to correctly analyse any older message. This applies within a given entity's domain, but also across entities: if you're trying to understand the interactions between 2 units, but one of those units is generating many more messages, you'll have 200 messages to inspect; but the 100 for the faster unit will only cover (say) the last 30 for the slower one, leaving 70 slow-unit messages that can't be correlated with the other unit's actions. At best, those messages are redundant; at worst, they're actively misleading. So: I do not believe that any approach that can be summed up as "let's throw away *more* messages" is going to help either. We need to fix (2) so that we have raw status data that extends reasonably far back in time; and then we need to fix (1) so that we usefully precis that data for the user (...and! leave a path that makes the raw data observable, for the cases where our heuristics are unhelpful). Cheers William PS re: UX of asking for N entries... I can see end-user stories for timespans, and for "the last N *significant* changes". What's the scenario where a user wants to see exactly 50 message atoms? On Thu, Mar 17, 2016 at 6:30 AM, John Meinel <j...@arbash-meinel.com> wrote: > > > On Thu, Mar 17, 2016 at 8:41 AM, Ian Booth <ian.bo...@canonical.com> > wrote: > >> >> Machines, services and units all now support recording status history. Two >> issues have come up: >> >> 1. https://bugs.launchpad.net/juju-core/+bug/1530840 >> >> For units, especially in steady state, status history is spammed with >> update-status hook invocations which can obscure the hooks we really care >> about >> >> 2. https://bugs.launchpad.net/juju-core/+bug/1557918 >> >> We now have the concept of recording a machine provisioning status. This >> is >> great because it gives observability to what is happening as a node is >> being >> allocated in the cloud. With LXD, this feature has been used to give >> visibility >> to progress of the image downloads (finally, yay). But what happens is >> that the >> machine status history gets filled with lots of "Downloading x%" type >> messages. >> >> We have a pruner which caps the history to 100 entries per entity. But we >> need a >> way to deal with the spam, and what is displayed when the user asks for >> juju >> status-history. >> >> Options to solve bug 1 >> >> A. >> Filter out duplicate status entries when presenting to the user. eg say >> "update-status (x43)". This still allows the circular buffer for that >> entity to >> fill with "spam" though. We could make the circular buffer size much >> larger. But >> there's still the issue of UX where a user ask for the X most recent >> entries. >> What do we give them? The X most recent de-duped entries? >> >> B. >> If the we go to record history and the current previous entry is the same >> as >> what we are about to record, just update the timestamp. For update >> status, my >> view is we don't really care how many times the hook was run, but rather >> when >> was the last time it ran. >> > > The problem is that it isn't the same as the "last" message. Going to the > original paste: > > TIME TYPE STATUS MESSAGE > 26 Dec 2015 13:51:59Z agent idle > 26 Dec 2015 13:56:57Z agent executing running update-status hook > 26 Dec 2015 13:56:59Z agent idle > 26 Dec 2015 14:01:57Z agent executing running update-status hook > 26 Dec 2015 14:01:59Z agent idle > > Which means there is an "running update-status" *and* a "idle" message. > So we can't just say "is the last message == this message". It would have > to look deeper in history, and how deep should we be looking? what happens > if a given charm does one more "status-set" during its update-status hook > to set the status of the unit to "still happy". Then we would have 3. > (agent executing, unit happy, agent idle) > > >> Options to solve bug 2 >> >> A. >> Allow a flag when setting status to say "this status value is transient" >> and so >> it is recorded in status but not logged in history. >> >> B. >> Do not record machine provisioning status in history. It could be argued >> this >> info is more or less transient and once the machine comes up, we don't >> care so >> much about it anymore. It was introduced to give observability to machine >> allocation. >> > > Isn't this the same as (A)? We need a way to say that *this* message > should be showed but not saved forever. Or are you saying that until a > machine comes up as "running" we shouldn't save any of the messages? I > don't think we want that, because when provisioning fails you want to know > what steps were achieved. > > >> >> Any other options? >> Opinions on preferred solutions? >> >> I really want to get this fixed before Juju 2.0 >> > > We could do a "log level" rather than just "transient or not", and that > would decide what would get displayed by default. (so you can ask for > 'update-status' messages but they wouldn't be shown by default). The > problem is that we want to keep status messages pruned at a sane level and > with 2 updates for every 'update-status' call history of 100 is only > 100/2*5/60 ~ 4hours of history. If something interesting happened > yesterday, you're SOL. > > What if we added a "interesting lifetime" to status messages. So the > status-set could indicate how long the message would be preserved? > "update-status" and "idle" could be flagged as preserved for only 1 hour, > and "dowloading %" could be flagged at say 5 minutes. Too complicated? It > certainly complicates the pruner (not terribly, when we record them we just > record an expire time that is indexed and the pruner just removes > everything that is over its expiry time.) > > Alternatively we could have some sort of UUID for messages to indicate > that "this message is actually similar to other messages with this UUID" > and we prune them based on that. (UUIDs get flagged with a different number > of messages to keep than the global 100 for otherwise untagged messages.) > > "Transient" is the easiest to understand, but doesn't really solve bug #1. > > If we think of the "UUID" version as something like a named "status > pocket" maybe its actually tasteful. You'd have the "global" pocket that > has our default 100 most-recent-messages, and then you can create any new > pocket that has a default of say 10 messages. So you would be doing: > status-set --pocket hook-execution update-status > status-set --pocket download Downloading X% done > > That also lets charms do nice things at hook execution time when they're > downloading large resources, without spamming the status-history log. > > It does complicate the model.... > > John > =:-> > > > > -- > Juju-dev mailing list > Juju-dev@lists.ubuntu.com > Modify settings or unsubscribe at: > https://lists.ubuntu.com/mailman/listinfo/juju-dev > >
-- Juju-dev mailing list Juju-dev@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev