Re: SWUpdate+EBG: The impossible state and how it's being handled so far

Stefano Babic Wed, 22 Feb 2023 00:36:44 -0800

Hi Christian, Jan,

On 21.02.23 22:21, Christian Storm wrote:

Hi,

playing with updates, I maneuvered the EBG envs on a system into this
weird state:


----------------------------
    Config Partition #0 Values:
in_progress:      yes
revision:         4
kernel:           C:BOOT1:linux.efi
kernelargs:
watchdog timeout: 0 seconds
ustate:           3 (FAILED)

user variables:
recovery_status = failed


Hm, did you start with a clean environment and SWUpdate >= 2022.12?


I think we can reach the status with any SWUpdate version.

----------------------------
    Config Partition #1 Values:
in_progress:      no
revision:         3
kernel:           C:BOOT1:linux.efi
kernelargs:
watchdog timeout: 0 seconds
ustate:           2 (TESTING)

user variables:


I see - we should *never* reach this state.


To get there, I started an upstate with swupdate and booted into testing
path #1.

Ok

But then didn't confirm this update and rather started it
again, using the same swu.


It looks to me that this is the point. SWUpdate requires to close the
transaction, for itself or for the deployment server (Hawkbit). If a
system boots with TESTING, the glue logic should start SWUpdate asking
to close the transaction - with OK or FAILED by passing the -c parameter.

However, this was thought to work together with the deployment server,
because it handles the state machine on Hawkbit. The parameter is
ignored if another deployment interface (Webserver, USB, ..) is used.


The suricatta modules handle this for you ― as a "convenience" feature
and to keep the (hawkBit, ...)  server's view of things consistent with
the device's, which is more important than the convenience aspect :)

Right, this was an initial decision. For not suricatta aka Hawkbit usecase, this is handled outside SWUpdate, often before running SWUpdate.It is duty of the integrator understand this and add the required gluelogic.

I just ask the question if this should be handled completely bySWUpdate, if configured. The "state" itself is part of SWUpdateś statemachine, too, and it could be moved into core, informing suricatta tosend the correct feedback to the deployment server.


If you're running it with other modules/modes, you're on your own.


Right.

Then, you have to play along the (convention) rules to close the
transaction as there's nothing preventing you to get into this
situation with EFI Boot Guard.

Exactly - issue raises because the transaction was not closed, and gluelogic is missing in the Jan'use case.


Hence, the valid question whether this should be allowed / denied by EFI
Boot Guard or the tools (SWUpdate in this case) making use of it?

IMHO EFI boot guard should be transparent, and someone else takes thedecision. My question here is if we add a way to avoid external gluelogic and put it into SWUpdate's core.

This is managed (again) on such situation on glue logic, and the
transaction (that is set of ustate) is done before starting SWUpdate. Or
in case of U-Boot, it is also managed with the help of additional (and
custom) variables.

In your case, it seems that nothing is done at boot time, and SWUpdate
is started. SWUpdate does not know (because it expects that someone has
already decided, and ustate is not checked) that a new software is
running, and the same SWU is loaded again.


Exactly, here you're on your own.


Yes, you are on your own !!

 You have to instrument EFI Boot Guard
so that it's happy... which is convention and not enforced, currently.
Granted, this requires a lot of context knowledge how to integrate
things properly and seamlessly...

That didn't complete because the UUID clash
was detected. swupdate terminated, and I was left with the above.

I can still boot this constellation, EBG will select path #1 (endless
testing, so to say). OTOH:

# bg_printenv -c
Using latest config partition
Values:
in_progress:      yes
revision:         4
kernel:           C:BOOT1:linux.efi
kernelargs:
watchdog timeout: 0 seconds
ustate:           3 (FAILED)

user variables:
recovery_status = failed


That is not quite correct. To be fair, bg_printenv deals with an illegal
state here.


Agree.

Still...

The key question is where to avoid best entering this state in the first
place?


My question is why the transaction was not closed before running
SWUpdate. This is a common pattern even with other bootloader, but it is
more important here because EBG stores an history (well, with deep=1) of
previous run.

SWUpdate can check the state when is running, but there is no general
cases. There are use cases where the OK is coming from the application,
and SWUpdate waits via IPC the result (but then SWUpdate is started with
WAIT option, and does not try to load a new SWU). So SWUpdate cannot
decide itself that TESTING is a wrong ustate, because it depends on a
single project.


One common pattern is to have a "health" target and once that's reached
you start SWUpdate with according parameters (or set them yourself via
some glueing method). But again, that is convention, not enforced, and
it's currently the responsibility of the system integrator to get right.


Exactly.

I was running swupdate manually from the command line. No backend
involved, just the desire to intentionally break things. ;)


The best way to reach the goal...:-D


If you would have used suricatta, you would have missed this :)

And yes, this can happen because the part deciding if previous update was ok,
is missing. In most projects, if system is up and running, it is considered
ok. That means the decision is done in SWUpdate's systemd run unit (or SystemV
init script), see also glue logic under /usr/lib/swupdate. In some other
cases, update is ok only if application is running, a migration of a custom
database was ok, ad, and....that means is outside SWUpdate. SWUpdate supports
all these use cases.


Yes, that's the codified context knowledge. Still, if you miss out on
one thing, the whole integration will crash and burn. And it's quite
easy to miss a thing...

There is a balaance between flexibility to cover all use cases andconvenience.


The question is whether there is a generic pattern like the "health"
target I sketched above so that SWUpdate can handle and abstract
the bootloader interactions?


Yes - and yes, the generic pattern is:

- transaction is closed by SWUpdate and not by another process

- SWUpdate evaluated ustate and close the transaction, independently ifthe update was done via suricatta, Webserver, USB, command line- related processes like suricatta are informed and they do what theyneed to do : suricatta sends feedback to Hawkbit.

Nevertheless, the "open" approach must still remain in case a customacknowledge is required. This is also very common, for example if anoperator must acknowledge the update via a GUI.


Then, any SWUpdate mode/module will behave the same and there's all
in one place reducing the need for having all the context knowledge...


Correct.

To avoid the issue you are seeing, the decsion should be done inside SWUpdate:
something like a transiction TESTING ==> OK, because SWUpdate is running. But
as I said, this can be done if it will be configurable, or it will break the
use cases I mentioned.


This is essentially promoting the current suricatta behavior to all
SWUpdate modes/modules w/o the remote reporting part if not run from
a suricatta module. Would be a starter...


Exactly, this is what should be done.

Regards,
Stefano



Kind regards,
    Christian


--
=====================================================================
DENX Software Engineering GmbH,        Managing Director: Erika Unter
HRB 165235 Munich,   Office: Kirchenstr.5, 82194 Groebenzell, Germany
Phone: +49-8142-66989-53 Fax: +49-8142-66989-80 Email: [email protected]
=====================================================================

--
You received this message because you are subscribed to the Google Groups "EFI Boot 
Guard" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/efibootguard-dev/30573098-aa19-d6cb-051d-356178307493%40denx.de.

Re: SWUpdate+EBG: The impossible state and how it's being handled so far

Reply via email to