Re: [OE-core] Findings from using Hash Equivalence server

Joshua Watt Thu, 21 Sep 2023 07:59:30 -0700

On Thu, Sep 21, 2023 at 6:38 AM Richard Purdie
<richard.pur...@linuxfoundation.org> wrote:
>
> On Thu, 2023-09-21 at 12:13 +0200, Tobias Hagelborn wrote:
> > Findings from using Hash Equivalence server at scale:
> > =====================================================
> >
> > We have now used the OE Hash Equivalence server at scale in our C/I chain 
> > at our company.
> > This has given some insights in what works and what can be improved when 
> > running this
> > service in full production.
> >
> > Some stats from our run:
> > ------------------------
> >
> > ### Hash Equivalence server:
> > * ~30M reqs/day
> > * ~500K new entries/day
> > * 2-3K reqs/s
> > * Data growth ~10GiB/day (with dbg info)
> >
> > ### C/I Builds:
> > * ~20K builds/day
> > * 15-20K tasks/build
> > * 140K sstate misses/day.
>
> These are really interesting numbers, thanks for sharing!


Very cool!

>
> > The good:
> > ---------
> > * It works! It finds reusable sstate tasks (in all of our builds)
> > * It is very valuable in recipes that invalidate
> >   almost any other package. (Examples: glib, openssl)
> > * Hashserve scales to the need even if only running one instance
> >   - Tested for 12K req/s which was sufficient for our needs
> > * Robust enough (if no external cleanup takes place)
> >   - _No_ non-recoverable crashes during 300M requests served
> >
> > The client site (Bitbake):
> > --------------------------
> > We have added a sanity check that disables the use of a remote HE server 
> > and switches
> > to a local one if the remote HE server cannot be connected. This is done 
> > from the
> > "ConfigParsed" event. The reason for this is to avoid builds hanging in case
> > the remote HE server is not responding.
>
> I'd have to see what that patch looked like but I get nervous about
> builds changing from the users configuration without the user making
> that change. If it shows a suitable warning it might be ok.
>
> ConfigParsed is a horrible place to do such a thing too. I know why
> you've done that but it would probably be better in the hashserve
> client code itself from an upstream perspective if you plan to submit
> it.
>
> > Areas of improvement:
> > ---------------------
> >
> > ### Data retention:
> > There is no built in data-retention. Solving this with recurring external
> > cleanup script. It works but also exposed locking in the server resulting
> > in inter-lock with the external cleanup and timeout for the clients.
> > This can partially be the nature of SQLite and a single file database.
> > OE Hashserve is not built for cleaning up data and the data growth is high 
> > so
> > it has to be handled.
> >
> > ### Protocol:
> > As our first tested option for deployment was Kubernetes, the absence of 
> > the de-facto
> > standard HTTP(s) protocol required some workarounds.
> > Routing, authentication and monitoring support gets lost on the way. I 
> > would suggest that we
> > look into using HTTP(s) + JSON and some Basic Auth as the next basis of the 
> > protocol.
>
> We did a lot of work to make answers "fast" from the server. To make
> 12k req/s possible, we couldn't use high level protocol like http, let
> alone https or signing. I would therefore caution that if you change it
> like this, it will no longer perform anywhere near as fast as you
> require and there would be knock on effects from that.

I think there are options here, like websockets that would perform at
pretty much the same levels and be compatible with web based
infrastructure, but it would take some effort. The trickier part is
making sure it's optional in the bitbake core, because it's not the
thing you want to do without using external python modules, and
bitbake core doesn't want to depend on external modules (by default
anyway). Once the websocket is established, the protocol is identical
so most of the bitbake code would be unchanged; it would only affect
how the connection is actually established with the server. I
architected the code with this in mind initially, so I think it's
mostly a matter of the logistics of how to implement it in bitbake.

On the server side, you'd want to make a stateless frontend HE service
that could talk to a backend database service so you can do all the
high availability and DB replication at scale. It's been on my TODO
list to write a server that works this way for a while (I think it
would be a separate project outside of bitbake), since it's hard to
have a client with no server to talk to :)

>
> > ### Security:
> > Supply-chain attacks is nowadays something to be aware of. This service 
> > uses a
> > non encrypted and non authenticated protocol and is thus open to 
> > man-in-the-middle
> > attacks, or any type of fake data and manipulation. The protocol changes 
> > suggested above
> > could mitigate some of this. Additionally, one thought is for the client to
> > provide a signature of the hash together with the hash so that it can be 
> > verified by the client using
> > it's secret upon retrieval.
>
> We've made the assumption that the write access ports would be under
> some kind of network control. For public services like the one the
> project shares, there is a risk of man-in-the-middle attacks but it
> would only really be exploitable if you can change the sstate being
> accessed too and if you can do that, you already have breached
> security.
>
> I suspect adding signing is going to affect speed a lot too
> unfortunately. We probably do need to think about what to do about all
> this but it isn't straightforward.

IIRC switching to a modern web infrastructure solves most of these
problems, since you can do SSL/client certificates/whatever much more
transparently without the server & client really having to know a
whole lot about it.

>
> > ### External database connection
> > An option for an external DB (For example PostgreSQL) would improve the
> > possibility for concurrent cleanup and vacuum while running.
> > In the stateless world of cloud/containers/pods, an option for external DB
> > would be favorable.
> > This would probably make use of some external package for DB interaction so 
> > it would
> > differ a little from the standard python only nature of Bitbake in general.
> >
> > ### Nice to have
> > - Minor changes that should be more easily added as patches
> >
> > * Hash (LRU) cache. From our stats, there seem to be a 1:20 ratio in writes 
> > to reads
> >   from the DB so a cache might save some resources.
> >
> > Conclusions
> > ===========
> > Hashserve is FOSS and if we want improvements, we have to contribute.
> > I will investigate to what extent we can chip in on some of these parts.
> > However, especially for the protocol and maybe also on external Python
> > dependencies, via PyPI, it would be nice to know if this is acceptable and
> > wished-for changes, before starting out.
>
> I know Joshua has had plans for a different version of hashserve using
> more scalable technology which we can't include directly in bitbake due
> to dependency issues. We had planned to keep the protocol simple and
> minimal so other server implementations were possible.
>
> We haven't really thought about supporting multiple client protocols,
> it would be possible if there are compelling reasons but it obviously
> would increase our complexity a lot and be hard to ensure everything
> says working with suitable testing.
>
> Cheers,
>
> Richard
>
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#188012): 
https://lists.openembedded.org/g/openembedded-core/message/188012
Mute This Topic: https://lists.openembedded.org/mt/101497051/21656
Group Owner: openembedded-core+ow...@lists.openembedded.org
Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [OE-core] Findings from using Hash Equivalence server

Reply via email to