Re: Master images are a mess

Paul Larson Wed, 07 Dec 2011 09:47:13 -0800

On Wed, Dec 7, 2011 at 10:01 AM, Zygmunt Krynicki <
zygmunt.kryni...@linaro.org> wrote:


> Hi, sorry for the topic, I wanted to catch your attention.
>
> This is a quick brain dump based on my own observations/battle with
> master images last week.
>
> 1) Unless we use external USB/ETH adapters then cloning a master image
> clones the mac address as well. This has serious consequences and I'm
>
This doesn't ring true.  We do have different mac addresses, even on boards
without flash and on-board ethernet.


> 100% sure that's why lava-test had to be switched to the random UUID
>
Incorrect - I hunted down that problem.  We can switch back if you really
like, but I don't see any advantage to it.


> mode. This problem applies to the master image mode. In the test image
> the software can do anything so we may run with a random MAC or with
> the mac that master images' boot loader set (we should check that).
> Since making master images is a mess, unless is becomes automated I
> will not be convinced that people just know how to make them properly
> and are not simply copying from someone. There is no reproducible
> master image creation process that ensure two people with the same
> board can run a single test in a reproducible way! (different starting
> rootfs/bootloader/package selection/random mistakes)
>
That's a pretty big exaggeration to say that it can't be done by others, or
that it affects reproducibility of tests.  The process isn't *that* hard.
It's essentially just a nano image, a couple of extra packages installed,
and add a few partitions.  However, I do agree with the sentiment that this
should be automated as much as possible.

2) Running code via serial on the master image is a mess. It is very
> fragile. We need an agent on the board instead of a random master
> image+serial shell. The agent will expose board identity, capabilities
> and standard APIs to LAVA (notably the dispatcher).
> The same API, if done sensibly, will work for software emulators and
> hardware boards. Agent API for a software emulator can do different
> things. Dispatcher should be based on agent API instead of ramming the
> serial line.
>
This sounds like a good connect topic.  It has some advantages, but also a
lot of things to address.


> 3) The master image, as we know it today, should be booting remotely.
> The boot loader can stay on the board until we can push it over USB.
>
The problem is getting it to a state that we can push it over usb for every
board.  Not all boards support this, and the ones that do sometimes have
issues with the tools to make it possible.  We've talked about other
solutions like a SD interface we can write from an external host over USB,
then boot the board.  One potential pitfall here is that this would mean we
can no longer offload the lmc process with celery.  It would HAVE to be
done from the attached host.  That means we are back to serializing LMC
processes, or we have a host for every single dev board!

The only thing that absolutely has to stay in the card is the lava
> board identity file which would be generated from the web UI. There is
>
If that's needed, then why couldn't it be written when we deploy to the
board?


> no reason to keep rootfs/kernel/initrd there. This means that a single
> small card can fit all tests as well. It also means we can reset the
> master image (as currently it is writeable by the board and can be
> corrupted) before booting to ensure consistent behaviour. I did some
> work on that and I managed to boot panda over NFS. Ideally I want to
> boot over nbd (netblock device) which is much faster and with proper
> "master image" init script we can expose a single read only net block
> device to _all_ the boards.
>
> 4) With agent on each board, identity file on the SD card LAVA will
> know if cloning happened. We could do dynamic board detection (unplug
> the board -> it goes away, plug it back -> it shows up). We could move
> a board from system to system and have 0config transitions.
>
Ok, you lost me here.  In the last point you made, you seemed to be
advocating for erasing the entire SD with the image we want to deploy.


> 5) Dispatcher should drop all configuration files. Sure it made sense
> 12 months ago when the idea was to run it standalone. Now all of that
> configuration should be in the database and should be provided by the
> scheduler to the dispatcher as a big serialized argument (or a file
> descriptor or a temporary file on disk). Setting up the dispatcher for
> a new instance is a pain and unless you can copy stuff from the
> validation server and ask everyone around for help it's very hard to
> get right. If master images could be constructed programmatically and
> with a agent on each "master image" lava would just get that
> configuration for free.
>
We should talk to our users about this though.  We already *know* we have
users that are using the dispatcher standalone today.  I think it's
possible to still move this config into the database, but we don't want to
pull the rug out from under anyone without a good plan of how to get them
standing again.


> 6) We should drop conmux. As in the lab we already have TCP/IP sockets
> for the serial lines we could just provide my example serial->tcp
> script as lava-serial service that people with directly attached
> boards would use. We could get a similar lava-power service if that
> would make sense. The lava-serial service could be started as an
> instance for all USB/SERIAL adapters plugged in if we really wanted
> (hello upstart!). The lava-power service would be custom and would
> require some config but it is very rare. Only lab and me have
> something like that. Again it should be instance based IMHO so I can
> say: 'start lava-power CONF=/etc/lava-power/magic-hack.conf' and see
> LAVA know about a power service. One could then say that a particular
> board uses a particular serial and power services.
>
Another good topic for the connect I think.  It needs a lot of fleshing
out, but it sounds like you are basically suggesting we should recreate the
functionality of conmux in lava.  Conmux does two primary things for us:
1. It gives us a single interface for dealing with a variety of boards
2. It provides a safer, more convenient way of dealing with console and
power on boards.  To console in to a board, and hardreset the power in
conmux:
 $ conmux-console panda01
 $ ~$hardreset

Without it?
 [lookup wiki or database to find the console server/port]
 $ telnet console01 7001
 [notice it's hung]
 ^] quit
 [lookup wiki or db to find the pdu server/port]
 $ telnet pdu01
 [go through long menu driven interface to tell it to reset the port]
 ^] quit
 $ telnet console01 7001
 [still hung... notice that you accidently reset some other board that
someone was running a test on]

Sure, we could provide a command line tool for looking up those things  in
the lava database, and give admins an easy interface to just say "take me
to the console of this machine", or "hardreset this machine".  If we did
that, and also added attached serial multiplexing, we will have...
rewritten conmux. :)

That beings said, I think it could still be useful because it can give us
an API to call from within lava, rather than having to call out to a
command line tool.  However, I don't see a huge urgency for it at the
moment.

Thanks,
Paul Larson

_______________________________________________
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev

Re: Master images are a mess

Reply via email to