Re: [OpenWrt-Devel] Notes on (dangerous ?) sysupgrade

2019-01-13 Thread Reiner Karlsberg

Am 13.01.2019 um 14:31 schrieb Jo-Philipp Wich:

Hi Reiner,


After having several unpleasant encounters using sysupgrade, I had a
quick glance at the code, after more or less successfully implementing
workarounds for incomplete sysupgrades, resulting in inconsistent systems.
My questions are:
- Is it safe, simply to kill running processes during sysupgrade ? As
there might be services, restarted automatically (by procd ?).


Roughly, the sysupgrade process is as follows:

1) /sbin/sysupgrade (shell script)

Parses arguments, sets default, assembles conffiles to backup, runs
partials scripts in /lib/upgrade, checks the image, ends with `ubus call
system sysupgrade`. All fatal exit conditions (such as invalid image)
should be handled here.

2) ubus call system sysupgrade (procd ubus procedure)

Invokes a procedure in procd that instructs procd to terminate itself
and exec into /sbin/upgraded (which has been copied to a ramdisk at
/tmp/root first), turning /tmp/root/sbin/upgraded into pid 1 and
releasing the pid 1 use of /.

3) /tmp/root/sbin/upgraded (binary)

Functions as pid 1 placeholder to prevent the kernel from panicking. It
does two things; keep serving the watchdog to prevent spontaneous resets
and executing /lib/upgrade/stage2

4) /lib/upgrade/stage2 (shell script)

Assemble backup tarball, write image, append backup tarball to just
written image. The exact procedure depends on the platform.


So yes, it is safe to simply kill processes in the sense that there will
be no procd running anymore at this point which would relaunch them.

Merely killing processes instead of shutting them down through their
respective init scripts is not ideal though, that eventually needs rework.

Ideally sysupgrade should try to cleanly stop as many services through
their respective init scripts as possible before invoking stage2, then
only do the 'kill TERM; sleep 3; kill KILL' sequence on processes that
somehow failed to stop initially (buggy init scripts, timeouts, ...).


-  What about a killed process, simply taking some time to shut down ?
(example: squid closing lot of open files on block-device; having
internal shutdown timer 30s by default)


Such services are not gracefully handled atm, see above.


- What about open swap file on block-device ?


 From a cursory look, it does not appear that sysupgrade currently
performs any swapoff at all, adding a `swapoff -a` after the process
termination would certainly make sense.


- What about mounted block-device for mass storage ?


Same as swap, there is no umount handling either as far as I can see. I
think this should be added as well along with the swapoff. Since the
sysupgrade runs off a pivot_root'ed /tmp/root at this point, all fses
should be free to umount. (Might still need two or three cycles due to
layered mounts).


- What about (slow) wwan connection, managed by pppd. When killed by
sysupgrade, will netifd restart pppd ?


It should not happen. Theoretically it could be that pppd is killed
first while netifd is still running, netifd will then try to restart
pppd shortly before netifd itself will get killed, but the second KILL
loop three seconds later should catch this rare circumstance.

However, as discussed above a graceful service shutdown would be better.


As a workaround, before calling sysupgrade I
- explicitly use /etc/init.d/most_services stop
- explicitly kill squid and wait for termination
- explicitly disable swap
- explicitly dismount mounted block-device
- ifdown wwan


That certainly makes a lot sense and most of this should probably go
into sysupgrade (stage1 aka /sbin/sysupgrade) directly. A slight
difficulty is see is how to identify "most_services" but I guess a
hardcoded whitelist for things like "dropbear", "openssh" or "telnetd"
will do.

As for awaiting squid termination - I think if its not already the case,
the squid init script should be reworked so that /etc/init.d/squid stop
does not return (successfully) before squid is actually stopped.


Before I had several cases, that
sysupgrade -n -v -f /tmp/newfiles.tar.gz /tmp/new_fw.bin
updated all files from /tmp/newfiles.tar.gz, but did not do the flash of
new_fw.bin


This is quite strange as appending the /tmp/newfiles.tar.gz archive will
only happen after /tmp/new_fw.bin has been written. I could only imagine
that the image write procedure itself somehow failed, but appending the
archive still worked.

How exactly this could fail depends on the platform. Can you provide
some more details about the device this issue occurred on?

~ Jo




I had these observations on my ZBT WE1026-5g.
I am running several special services, like squid, collectd, chilli, nginx, 
uhttpd, openvpn.
The WE1026-5g includes a SD-card, used for swap-file, squid-caching, logfiles 
(from squid and nginx).
Quectel EC25 is used for wwan (serial, 3g); but I _think_ I had same effects 
using wan instead of wwan, too.
Additionally, I have several simple private processes, like continuous ping to 
keep wwan 

Re: [OpenWrt-Devel] Notes on (dangerous ?) sysupgrade

2019-01-13 Thread Jo-Philipp Wich
Hi Reiner,

> After having several unpleasant encounters using sysupgrade, I had a
> quick glance at the code, after more or less successfully implementing
> workarounds for incomplete sysupgrades, resulting in inconsistent systems.
> My questions are:
> - Is it safe, simply to kill running processes during sysupgrade ? As
> there might be services, restarted automatically (by procd ?).

Roughly, the sysupgrade process is as follows:

1) /sbin/sysupgrade (shell script)

Parses arguments, sets default, assembles conffiles to backup, runs
partials scripts in /lib/upgrade, checks the image, ends with `ubus call
system sysupgrade`. All fatal exit conditions (such as invalid image)
should be handled here.

2) ubus call system sysupgrade (procd ubus procedure)

Invokes a procedure in procd that instructs procd to terminate itself
and exec into /sbin/upgraded (which has been copied to a ramdisk at
/tmp/root first), turning /tmp/root/sbin/upgraded into pid 1 and
releasing the pid 1 use of /.

3) /tmp/root/sbin/upgraded (binary)

Functions as pid 1 placeholder to prevent the kernel from panicking. It
does two things; keep serving the watchdog to prevent spontaneous resets
and executing /lib/upgrade/stage2

4) /lib/upgrade/stage2 (shell script)

Assemble backup tarball, write image, append backup tarball to just
written image. The exact procedure depends on the platform.


So yes, it is safe to simply kill processes in the sense that there will
be no procd running anymore at this point which would relaunch them.

Merely killing processes instead of shutting them down through their
respective init scripts is not ideal though, that eventually needs rework.

Ideally sysupgrade should try to cleanly stop as many services through
their respective init scripts as possible before invoking stage2, then
only do the 'kill TERM; sleep 3; kill KILL' sequence on processes that
somehow failed to stop initially (buggy init scripts, timeouts, ...).

> -  What about a killed process, simply taking some time to shut down ?
> (example: squid closing lot of open files on block-device; having
> internal shutdown timer 30s by default)

Such services are not gracefully handled atm, see above.

> - What about open swap file on block-device ?

From a cursory look, it does not appear that sysupgrade currently
performs any swapoff at all, adding a `swapoff -a` after the process
termination would certainly make sense.

> - What about mounted block-device for mass storage ?

Same as swap, there is no umount handling either as far as I can see. I
think this should be added as well along with the swapoff. Since the
sysupgrade runs off a pivot_root'ed /tmp/root at this point, all fses
should be free to umount. (Might still need two or three cycles due to
layered mounts).

> - What about (slow) wwan connection, managed by pppd. When killed by
> sysupgrade, will netifd restart pppd ?

It should not happen. Theoretically it could be that pppd is killed
first while netifd is still running, netifd will then try to restart
pppd shortly before netifd itself will get killed, but the second KILL
loop three seconds later should catch this rare circumstance.

However, as discussed above a graceful service shutdown would be better.

> As a workaround, before calling sysupgrade I
> - explicitly use /etc/init.d/most_services stop
> - explicitly kill squid and wait for termination
> - explicitly disable swap
> - explicitly dismount mounted block-device
> - ifdown wwan

That certainly makes a lot sense and most of this should probably go
into sysupgrade (stage1 aka /sbin/sysupgrade) directly. A slight
difficulty is see is how to identify "most_services" but I guess a
hardcoded whitelist for things like "dropbear", "openssh" or "telnetd"
will do.

As for awaiting squid termination - I think if its not already the case,
the squid init script should be reworked so that /etc/init.d/squid stop
does not return (successfully) before squid is actually stopped.

> Before I had several cases, that
> sysupgrade -n -v -f /tmp/newfiles.tar.gz /tmp/new_fw.bin
> updated all files from /tmp/newfiles.tar.gz, but did not do the flash of
> new_fw.bin

This is quite strange as appending the /tmp/newfiles.tar.gz archive will
only happen after /tmp/new_fw.bin has been written. I could only imagine
that the image write procedure itself somehow failed, but appending the
archive still worked.

How exactly this could fail depends on the platform. Can you provide
some more details about the device this issue occurred on?

~ Jo



signature.asc
Description: OpenPGP digital signature
___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


Re: [OpenWrt-Devel] Notes on (dangerous ?) sysupgrade

2019-01-13 Thread Sebastian Moeller
Mmmh, I have a hunch a recent observation of mine might be related 
(unfortunately I have no log data):

Within a set of recent master builds, sysupgrade from a system running for a 
few days resulted in the expected behavior in that sysupgrade disconnected the 
current ssh connection and the LED pattern on the router looked as if the 
device was updating itself but after the automatic reboot the router came back 
up still with the previous firmware. Redoing the sussupgrade on that just 
booted system so far always worked as expected. I think I only started to see 
this with builds from end of last year on. So far I brushed this off as signs 
that my trusty old wndr3700v2 might be reaching the end of its life, but with 
this report I am not so sure and will try to log things better the next time 
this happens.


> On Jan 13, 2019, at 11:08, Reiner Karlsberg  wrote:
> 
> I am an unhappy user of sysupgrade for remote installed devices.
> (Besides these ones:  
> https://forum.openwrt.org/t/sysupgrade-return-code-in-18-06-vs-17-01/22316/9)
> 
> After having several unpleasant encounters using sysupgrade, I had a quick 
> glance at the code, after more or less successfully implementing workarounds 
> for incomplete sysupgrades, resulting in inconsistent systems.
> My questions are:
> - Is it safe, simply to kill running processes during sysupgrade ? As there 
> might be services, restarted automatically (by procd ?).
> -  What about a killed process, simply taking some time to shut down ? 
> (example: squid closing lot of open files on block-device; having internal 
> shutdown timer 30s by default)
> - What about open swap file on block-device ?
> - What about mounted block-device for mass storage ?
> - What about (slow) wwan connection, managed by pppd. When killed by 
> sysupgrade, will netifd restart pppd ?
> 
> As a workaround, before calling sysupgrade I
> - explicitly use /etc/init.d/most_services stop
> - explicitly kill squid and wait for termination
> - explicitly disable swap
> - explicitly dismount mounted block-device
> - ifdown wwan
> 
> Before I had several cases, that
> sysupgrade -n -v -f /tmp/newfiles.tar.gz /tmp/new_fw.bin
> updated all files from /tmp/newfiles.tar.gz, but did not do the flash of 
> new_fw.bin
> Resulting in inconsistent system.
> 
> ___
> openwrt-devel mailing list
> openwrt-devel@lists.openwrt.org
> https://lists.openwrt.org/mailman/listinfo/openwrt-devel


___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel


[OpenWrt-Devel] Notes on (dangerous ?) sysupgrade

2019-01-13 Thread Reiner Karlsberg

I am an unhappy user of sysupgrade for remote installed devices.
(Besides these ones:  
https://forum.openwrt.org/t/sysupgrade-return-code-in-18-06-vs-17-01/22316/9)

After having several unpleasant encounters using sysupgrade, I had a quick glance at the code, after more or less 
successfully implementing workarounds for incomplete sysupgrades, resulting in inconsistent systems.

My questions are:
- Is it safe, simply to kill running processes during sysupgrade ? As there might be services, restarted automatically 
(by procd ?).
-  What about a killed process, simply taking some time to shut down ? (example: squid closing lot of open files on 
block-device; having internal shutdown timer 30s by default)

- What about open swap file on block-device ?
- What about mounted block-device for mass storage ?
- What about (slow) wwan connection, managed by pppd. When killed by 
sysupgrade, will netifd restart pppd ?

As a workaround, before calling sysupgrade I
- explicitly use /etc/init.d/most_services stop
- explicitly kill squid and wait for termination
- explicitly disable swap
- explicitly dismount mounted block-device
- ifdown wwan

Before I had several cases, that
sysupgrade -n -v -f /tmp/newfiles.tar.gz /tmp/new_fw.bin
updated all files from /tmp/newfiles.tar.gz, but did not do the flash of 
new_fw.bin
Resulting in inconsistent system.

___
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel