I see this as a combination of two problems:

1) We're spamming the end user with "whatever's in the status-history
collection" rather than presenting a digest tuned for their needs.
2) Important messages get thrown away way too early, because we don't know
which messages are important.

I think the pocket/transient/expiry solutions boil down to "let's make the
charmer decide what's important", and I don't think that will help. The
charmer is only sending those messages *because she believes they're
important*; even if we had "perfect" trimming heuristics for the end user,
we do the *charmer* a disservice by leaving them no record of what their
charm actually did.

And, more generally: *every* message we throw away makes it hard to
correctly analyse any older message. This applies within a given entity's
domain, but also across entities: if you're trying to understand the
interactions between 2 units, but one of those units is generating many
more messages, you'll have 200 messages to inspect; but the 100 for the
faster unit will only cover (say) the last 30 for the slower one, leaving
70 slow-unit messages that can't be correlated with the other unit's
actions. At best, those messages are redundant; at worst, they're actively
misleading.

So: I do not believe that any approach that can be summed up as "let's
throw away *more* messages" is going to help either. We need to fix (2) so
that we have raw status data that extends reasonably far back in time; and
then we need to fix (1) so that we usefully precis that data for the user
(...and! leave a path that makes the raw data observable, for the cases
where our heuristics are unhelpful).

Cheers
William

PS re: UX of asking for N entries... I can see end-user stories for
timespans, and for "the last N *significant* changes". What's the scenario
where a user wants to see exactly 50 message atoms?

On Thu, Mar 17, 2016 at 6:30 AM, John Meinel <j...@arbash-meinel.com> wrote:

>
>
> On Thu, Mar 17, 2016 at 8:41 AM, Ian Booth <ian.bo...@canonical.com>
> wrote:
>
>>
>> Machines, services and units all now support recording status history. Two
>> issues have come up:
>>
>> 1. https://bugs.launchpad.net/juju-core/+bug/1530840
>>
>> For units, especially in steady state, status history is spammed with
>> update-status hook invocations which can obscure the hooks we really care
>> about
>>
>> 2. https://bugs.launchpad.net/juju-core/+bug/1557918
>>
>> We now have the concept of recording a machine provisioning status. This
>> is
>> great because it gives observability to what is happening as a node is
>> being
>> allocated in the cloud. With LXD, this feature has been used to give
>> visibility
>> to progress of the image downloads (finally, yay). But what happens is
>> that the
>> machine status history gets filled with lots of "Downloading x%" type
>> messages.
>>
>> We have a pruner which caps the history to 100 entries per entity. But we
>> need a
>> way to deal with the spam, and what is displayed when the user asks for
>> juju
>> status-history.
>>
>> Options to solve bug 1
>>
>> A.
>> Filter out duplicate status entries when presenting to the user. eg say
>> "update-status (x43)". This still allows the circular buffer for that
>> entity to
>> fill with "spam" though. We could make the circular buffer size much
>> larger. But
>> there's still the issue of UX where a user ask for the X most recent
>> entries.
>> What do we give them? The X most recent de-duped entries?
>>
>> B.
>> If the we go to record history and the current previous entry is the same
>> as
>> what we are about to record, just update the timestamp. For update
>> status, my
>> view is we don't really care how many times the hook was run, but rather
>> when
>> was the last time it ran.
>>
>
> The problem is that it isn't the same as the "last" message. Going to the
> original paste:
>
> TIME                    TYPE    STATUS          MESSAGE
> 26 Dec 2015 13:51:59Z   agent   idle
> 26 Dec 2015 13:56:57Z   agent   executing       running update-status hook
> 26 Dec 2015 13:56:59Z   agent   idle
> 26 Dec 2015 14:01:57Z   agent   executing       running update-status hook
> 26 Dec 2015 14:01:59Z   agent   idle
>
> Which means there is an "running update-status" *and* a "idle" message.
> So we can't just say "is the last message == this message". It would have
> to look deeper in history, and how deep should we be looking? what happens
> if a given charm does one more "status-set" during its update-status hook
> to set the status of the unit to "still happy". Then we would have 3.
> (agent executing, unit happy, agent idle)
>
>
>> Options to solve bug 2
>>
>> A.
>> Allow a flag when setting status to say "this status value is transient"
>> and so
>> it is recorded in status but not logged in history.
>>
>> B.
>> Do not record machine provisioning status in history. It could be argued
>> this
>> info is more or less transient and once the machine comes up, we don't
>> care so
>> much about it anymore. It was introduced to give observability to machine
>> allocation.
>>
>
> Isn't this the same as (A)? We need a way to say that *this* message
> should be showed but not saved forever. Or are you saying that until a
> machine comes up as "running" we shouldn't save any of the messages? I
> don't think we want that, because when provisioning fails you want to know
> what steps were achieved.
>
>
>>
>> Any other options?
>> Opinions on preferred solutions?
>>
>> I really want to get this fixed before Juju 2.0
>>
>
> We could do a "log level" rather than just "transient or not", and that
> would decide what would get displayed by default. (so you can ask for
> 'update-status' messages but they wouldn't be shown by default). The
> problem is that we want to keep status messages pruned at a sane level and
> with 2 updates for every 'update-status' call history of 100 is only
> 100/2*5/60 ~ 4hours of history. If something interesting happened
> yesterday, you're SOL.
>
> What if we added a "interesting lifetime" to status messages. So the
> status-set could indicate how long the message would be preserved?
> "update-status" and "idle" could be flagged as preserved for only 1 hour,
> and "dowloading %" could be flagged at say 5 minutes. Too complicated? It
> certainly complicates the pruner (not terribly, when we record them we just
> record an expire time that is indexed and the pruner just removes
> everything that is over its expiry time.)
>
> Alternatively we could have some sort of UUID for messages to indicate
> that "this message is actually similar to other messages with this UUID"
> and we prune them based on that. (UUIDs get flagged with a different number
> of messages to keep than the global 100 for otherwise untagged messages.)
>
> "Transient" is the easiest to understand, but doesn't really solve bug #1.
>
> If we think of the "UUID" version as something like a named "status
> pocket" maybe its actually tasteful. You'd have the "global" pocket that
> has our default 100 most-recent-messages, and then you can create any new
> pocket that has a default of say 10 messages. So you would be doing:
>  status-set --pocket hook-execution update-status
>  status-set --pocket download Downloading X% done
>
> That also lets charms do nice things at hook execution time when they're
> downloading large resources, without spamming the status-history log.
>
> It does complicate the model....
>
> John
> =:->
>
>
>
> --
> Juju-dev mailing list
> Juju-dev@lists.ubuntu.com
> Modify settings or unsubscribe at:
> https://lists.ubuntu.com/mailman/listinfo/juju-dev
>
>
-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev

Reply via email to