Re: [Openembedded-architecture] [OE-core] Core workflow: sstate for all, bblock/bbunlock, tools for why is sstate not being reused?

Adrian Freihofer Thu, 23 Nov 2023 12:57:41 -0800

On Wed, 2023-11-22 at 21:14 -0600, Mark Hatle via
lists.openembedded.org wrote:
> 
> 
> On 11/22/23 8:56 AM, [email protected] wrote:
> > > While being able to support ssh (or a related protocol) is
> > > useful,
> > > you need to
> > > also remember that MANY MANY organizations absolutely block SSH
> > > access through
> > > their firewalls.  So _requiring_ sftp would be bad.  Allowing
> > > it's
> > > usage would
> > > be good.
> > > 
> > > As for logging in, https is transport 'security' but not
> > > authentication without
> > > additional helpers.  I think it's absolutely reasonable to say
> > > https
> > > access
> > > either needs an external helper for authentication purposes or
> > > it's
> > > un-authenticated.
> > Yes, there should be an option for a authentication maybe before
> > the
> > server and a helper for doing authorization.
> > 
> > >    If you want (internal to the company) then ssh/sftp or
> > > similar using the ssh-agent (or similar) should be the suggested
> > > approach.
> > > 
> > > We don't want to exclude anyone, but we want to be clear on the
> > > limitations
> > > based on an organization's specific choice.
> > > 
> > You are right. https is what everybody (including us) really wants.
> > 
> > > > If one then wants to scale an sstate-cache server for many
> > > > different
> > > > projects and users, one quickly wishes for an option for
> > > > authorization
> > > > at artifact level. Ideally, the access rights to the source
> > > > code
> > > > would
> > > > be completely transferred to the associated sstate artifacts.
> > > > For
> > > > such
> > > > an authorization the ssate mirror server would require the
> > > > SRC_URI
> > > > which was used to compile the sstate artifact. With this
> > > > information,
> > > > it could ask the Git server whether or not a user has access to
> > > > all
> > > > source code repositories to grant or deny access to a
> > > > particular
> > > > sstate
> > > > artifact. It should not be forgotten that the access rights to
> > > > the
> > > > Git
> > > > repositories can change.
> > > 
> > > In my experience you do not use _one_ sstate-cache for multiple
> > > projects (at an
> > > organization level), each project is responsible for it's own
> > > cache.
> > > This
> > > prevents even the possibility that one project could use code not
> > > intended for it.
> > 
> > But the sstate cache is already perfectly suited for deduplicated
> > archiving. The idea of storing all artifacts on a file server and
> > storing additional metadata that enables authorization, storage and
> > backup is therefore obvious when it comes to scaling.
> > 
> > > 
> > >   From a more generic Yocto Project perspective, this means you
> > > really
> > > want to
> > > use a hierarchy of sstate-caches.  (Maybe not a true hierarchy.).
> > > I.e. I use YP,
> > > so I get the YP sstate-cache for the base functionality.  I use
> > > meta-openembedded, so I want the meta-openembedded cache...
> > > project
> > > A, I want
> > > the project's cache, OE and YP caches as well..  project B, I
> > > want
> > > that projects
> > > cache, OE and YP caches.  Project C?  I might want it's cache,
> > > Project A,
> > > Project B, and OE and YP.   You can see this gets complicated
> > > quickly.
> > > 
> > > If this either isn't inteded or a good idea, then alternatives
> > > need
> > > to be
> > > provided for this.  Everyone always ends up with an upstream
> > > provider
> > > (or
> > > providers) be it YP, OE, OSVs, ISVs, local company resources,
> > > etc.
> > > How do we
> > > manage this and keep it aligned?
> > 
> > I tried to look at it in a hierarchical way. But the longer I
> > looked
> > into it, the less sense it made. The hierarchy is not the most
> > logical
> > thing a sstate cache really has. The most logical thing is the
> > content
> > addressable file names. I see more similarities with e.g. a git
> > tree or
> > a container registry than with a hierarchical system. There are
> > content
> > addressable objects and a database with meta information for
> > organizing
> > these objects.
> 
> I agree.. but where I think hierarchy is trust.  I trust YP to be the
> first 
> place to check, I trust my OSV, my ISV, etc..  The reality though is
> git may be 
> a better way to think of it.  It's a collection of hashes from
> multiple sources. 
>   The only priority here is the order in which we search and if you
> include 
> 'untrusted' sources it's at your own risk (of course this is when
> sstate-cache 
> signing becomes importan.....)


Yes, that's my understanding too. We trust that someone will provide an
integer Git tree with layers and bitbake and we trust that someone will
provide us with the corresponding (signed) sstate artifacts.

With binary reproducible builds there is another possibility: several
parties can confirm that the hash of an object matches by performing
the same build. It would therefore be possible to build a system in
which an artifact can have multiple sign-offs. Like git commits.

> 
> > > 
> > > Bring in hash equivalency and PR service and things get
> > > complicated.
> > > The
> > > sstate-cache itself is NOT separable from those services.  There
> > > are
> > > ways to
> > > decouple them, but they can be 'extreme'.  I.e. turn off hash-
> > > equivalency, no
> > > need for a hash-equivalency service.   Don't cache the
> > > do_package_write* files,
> > > no PR service....  (but even that isn't fool proof due to git
> > > AUTOINC... so you
> > > end up seeding the AUTOINC with static entries or some other
> > > method...)
> > > 
> > > All of these items need to be dealt with and documented
> > > together.  My
> > > PERSONAL
> > > preference, (without knowing any specific implementation details)
> > > is
> > > that the
> > > contents of hash-equivalency and PR service is somehow stored
> > > with
> > > the sstate-cache.
> > 
> > That's basically the database which is needed to organize the
> > sstate-
> > cache. If bitbake would add more information to this database when
> > uploading the sstate-cache artifacts, the database could be used
> > for
> > more use cases such as authorization or data retention. One example
> > could be the SRC_URI. It would e.g. help with the authorization on
> > artifact level (the user must have access to all repositories
> > refereed
> > by the SRC_URI to get access to the sstate artifact as well or with
> > the
> > organization of an internal source mirror (each source artifact
> > compiled into an sstate artifact should be on the mirror).
> 
> As an optimization the local system needs a database (PR service and
> hash 
> equivalency).  But the data store and retrieval is better in a
> "filesystem" like 
> storage.  So how do we move from one to the other (sstate-cache is
> already using 
> a filesystem like storage to do many things, so this becomes an
> extension.)
> 
> This allows people to then serve/retrieval sstate-cache via https
> from a web 
> server, a local artifactory, etc etc etc.  Along with the sstate-
> cache they 
> automatically get the matchinng hash-equivalencey and PR service data
> which gets 
> injected into their local database.
> 
> So you get into a situation where you calculate the hash of what you
> need, and 
> retrieve it.  The retrieval could be (the equivalent of) a link that
> has pointer 
> information to the 'real' object, and the link gets injected into the
> local hash 
> equivalency server and then the 'real' object downloaded.. (repeat as
> needed). 
> If the object requested uses 'PR' data, then it should include the
> appropriate 
> data, which can be observerd and used to update the local PR server..
> i.e.
> 
> calculate do_write_package is hash abcde, we download abcde, inspect
> it for PR = 
> r0.1, then write PN, PV, PE, PR (hash abcde) into the PR database and
> we're got 
> a new matching entry, and possibly an updated "newest AUTOINC"
> version. 
> (Obviously it's not THAT easy, but that's the idea and it keeps it
> all central 
> on the sstate-cache.)  The key is when the sstate-cache is written
> out, do these 
> 'links' and PR service data get included (always) or do we need an
> export step 
> to add them.. (I'd prefer the former, but I've no idea what it does
> to build 
> performance!)
> 
> > > 
> > > One possible way this could be done..  System starts up,
> > > determines
> > > it needs
> > > something it doesn't have, then goes out and checks if an updated
> > > index is
> > > present.
> > >   If it is, downloads it adds to it's hash equivalency server. 
> > > If no
> > > index present, it can then look for the file lets say
> > > "sstate:....link".  If
> > > that comes back, we know we have an equivalency, it's downloaded
> > > added to the
> > > local database and then the pointed to file is retrieved.. 
> > > (.siginfo
> > > and
> > > .tar.xz or whatever).  This would ensure that the index is an
> > > optimization, but
> > > not a requirement and would allow a "live" sstate-cache while
> > > losing
> > > some
> > > performance.  (This doesn't negate any of the comments about
> > > rights
> > > or
> > > possibility to DoS a server via too many connections!)
> > 
> > Not sure I got this idea. But due to the fact that the sstate is
> > content addressable, clients can calculate the index of the sstate-
> > cache locally based on the layer data. If a client does something
> > like:
> > 
> > bitbake core-image-minimal --setscene-only -DD | grep "^DEBUG:
> > SState:
> > Looked for but didn't find file "
> > 
> > bitbake prints all the missing sstate artifacts.
> > 
> > With hash equivalence the missing artifacts have to be mapped to
> > the
> > available artifacts as an additional step which then probably makes
> > a
> > central server mandatory.
> > 
> > > 
> > > Still have to solve the PR service problem, but this could get
> > > 'seeded' via the
> > > associated do_write_package siginfo or similar..  and for the
> > > AUTOINC, seed it
> > > from the siginfo file for a given hash?
> > > 
> > > Doing something like the above could then allow the order
> > > specificed
> > > in the
> > > SSTATE_MIRRORS to be used to truely indicate the order things are
> > > resolved and
> > > loaded.
> > > > > Recently we've been wondering about teaching the hashequiv
> > > > > server
> > > > > about
> > > > > "presence", which would then mean the build would only query
> > > > > things
> > > > > that stood a good chance of existing.
> > > > > 
> > > > Yes, that sound very interesting. There are probably even more
> > > > such
> > > > kind of meta data which could be provided by the hashserver to
> > > > improve
> > > > the management of a shared sstate mirror.
> > > > 
> > > > Would it make sense to include e.g. the SRC_URI in the hashserv
> > > > database and extend the hashserver's API to also provide meta
> > > > data
> > > > e.g.
> > > > for the authorization of the sstate-mirror? Or is security and
> > > > authorization something which should be handled independently
> > > > from
> > > > hash
> > > > equivalence?
> > > 
> > > The more I've thought about this, any sort of query directly to a
> > > remote
> > > hashservice seems more and more problematic..  Local hash
> > > database,
> > > absolutely
> > > needed as an optimization.
> > > 
> > > There is a second problem.  My org for instant, it's easy for me
> > > to
> > > request
> > > https server where I can serve files to the public.  But asking
> > > for
> > > our IT to
> > > support a hash equivalency (and pr) server?  This will likely
> > > take
> > > months of
> > > negotiation, possible security review, mitigation process, etc
> > > etc
> > > etc.. and no
> > > guaranty that it will actually get approved.  I expect other
> > > people
> > > will be in a
> > > similar situation.
> > 
> > Yes, this is often the case. Functions such as hash equivalence are
> > not
> > helpful in this respect. But they are also too great to be
> > considered
> > completely optional ;-).
> > 
> > With ongoing cloudification it's also more and more common that IT
> > departments support containerized applications and not only static
> > web
> > servers. But usually fully authenticated https is the only allowed
> > protocol.
> > 
> > It would be really nice to document or even develop some containers
> > and
> > terraforms or Kubernetes yaml files which are usable to build a
> > build
> > cluster, sstate mirror, pr server, sources mirror infrastructure.
> > It's
> > getting more and more complicated to do this at scale and with
> > security
> > in mind. Basic idea:
> > 
> > source-mirror---|   auth        / <-rw-> build-server(s)
> > sstate-storage--|     |         |
> >                  |----API -------|
> > sstate-db ------|     |         |
> > pr-server ------|     |         |
> >                        |         \ --ro-> build clients with SDK
> >                        |
> >                export to files
> >             (SDK, mirror, backup) >
> 
> My experience is that getting a new service added (and as you said
> https with 
> authentication is often required) is still more difficult then using
> existing 
> file services such as artifactory or just a simple file store that
> already 
> exists.  (And IT departments are much more willing to service pure
> files, then 
> instantiate a service that now has to be monitored for security and
> updated 
> regularly when there is no one "responsible" for this.  The cloud
> service model 
> has moved a lot of things to outside vendors as the ones responsible
> for these 
> actions and the YP components don't necessarily fit this model.)
> 
> Mind you for someone who CAN do this, it very well could be a more
> efficient 
> protocol, especially within an organization that may have different
> sites but 
> still remains inside an intranet.. but moving outside to the internet
> itself 
> adding services can be difficult.

Ideally there would be a generic solution for an sstate backend which
supports the following example use cases:

- One machine builds from sources without any external dependencies.
  That means the sstate backend is fully optional
- A setup like the Yocto A/B
  - Bare metal build machines, centralized setup
  - sstate is on a NFS share (file share)
  - meta data are contributed to a database (hash equivalence, pr)
    only from a hand-full of build machines
  - meta data + artifacts are provided read only for downloading
- A completely distributed setup
  - Everything is https, could be public Internet
  - Everything can be authenticated
  - Authorization plugins are supported
  - Bitbake can run in containers
    - build machines can be shared with non Yocto use cases
    - Containers can access http servers but not mount NFS
  - Theoretically a world wide build cluster is possible
    - There are build machines with read-only access (laptops with SDK)
    - Build machines with write access to sstate artifacts, metadata
    - But also every laptop could contribute if that makes sense
    - Binary reproducible builds and signed sstate artifacts are
      a perfect basis for that.

The completely distributed setup might looks a bit fancy. But it is
what container registries and some container build infrastructures
fully support. A container registry is more than a web server and it's
not a file server. It's a special web service for good reasons.

It is possible to download a container image file and save it on a file
system, send it by e-mail or even share it via a CDN. And I think that
must also be possible with sstate artifacts. But I think if you really
want to make the sstate scalable and shareable, you can't get around a
special service anymore. Taking the concept from a file-based
implementation to an API-based implementation opens up so many
possibilities.

Managing the sstate with a dedicated http based service is not only
interesting for sharing the sstate with users. It's also interesting to
have it like this as a replacement for e.g. an NFS server in a build
farm. The NFS server looks like a good solution for a bare metal build
farm which is located in one server rack. But for a build farm that
runs in the cloud or uses shared build machines from e.g. Github,
Gitlab or a similar setup within an organization, a web-based sstate
backend would be much better. It is usually not possible to integrate
an NFS server into such an infrastructure. It is also usually not
possible to mount an NFS share into a build job if the build job is
running in a container.

Another argument against an NFS server is cost. On a cloud storage like
S3 is much cheaper(and much easier to get) than a high available NFS
server.

Best regards,
Adrian


> 
> > > 
> > > > Another topic where additional meta data about the sstate-cache
> > > > seams
> > > > to be beneficial is sstate-mirror retention. Knowing which
> > > > artifact
> > > > was
> > > > compiled for which tag or commit of the bitbake layer could
> > > > help to
> > > > wipe out some artifacts which are not needed anymore.
> > > > 
> > > > > >       - A script which gets a list of sstate artifacts from
> > > > > > bitbake
> > > > > > and then
> > > > > >         does a upfront download works much better
> > > > > >          + The script runs only when the user calls it or
> > > > > > the
> > > > > > SDK
> > > > > > gets boot-
> > > > > >            strapped
> > > > > >          + The script uses a reasonable amount of parallel
> > > > > > connections which
> > > > > >            are re-used for more then one artifact download
> > > > > 
> > > > > Explaining to users they need to do X before Y quickly gets
> > > > > tiring,
> > > > > both for people explaining it and the people doing it trying
> > > > > to
> > > > > remember. I'd really like to get to a point where the system
> > > > > "does
> > > > > the
> > > > > right thing" if we can.
> > > > > 
> > > > > I don't believe the problems you describe are insurmountable.
> > > > > If
> > > > > you
> > > > > are using sftp, that is going to be a big chunk of the
> > > > > problem as
> > > > > the
> > > > > system assumes something faster is available. Yes, I've taken
> > > > > patches
> > > > > to make sftp work but it isn't recommended at all. I
> > > > > appreciate
> > > > > there
> > > > > would be reasons why you use sftp but if it is possible to
> > > > > get a
> > > > > list
> > > > > of "available sstate" via other means, it would improve
> > > > > things.
> > > > > 
> > > > > >    * Idea for a smart lock/unlock implementation
> > > > > >       - Form a user's perspective a locked vs. an unlocked
> > > > > > SDK
> > > > > > does
> > > > > > not make
> > > > > >         much sense. It makes more sense if the SDK would
> > > > > > automatically
> > > > > >         download the sstate-cache if it is expected to be
> > > > > > available.
> > > > > >         Lets think about an implementation (which allows to
> > > > > > override
> > > > > > the
> > > > > >         logic) to switch from automatic to manual mode:
> > > > > >         
> > > > > >         SSTATE_MIRRORS_ENABLED ?=
> > > > > > "${is_sstate_mirror_available()}"
> > > > > 
> > > > > What determines this availability? I worry that is something
> > > > > very
> > > > > fragile and specific to your use case. It is also not an all
> > > > > or
> > > > > nothing
> > > > > binary thing.
> > > > 
> > > > It would probably be better to query a harserver if an artifact
> > > > is
> > > > present.
> > > > > 
> > > > > >         In our case the sstate mirror is expected to
> > > > > > provide all
> > > > > > artifacts
> > > > > >         for tagged commits and for some git branches of the
> > > > > > layer
> > > > > >         repositories.
> > > > > >         The sstate is obviousely not usable for a "dirty"
> > > > > > git
> > > > > > layer
> > > > > >         repository.
> > > > > 
> > > > > That isn't correct and isn't going to work. If I make a
> > > > > single
> > > > > change
> > > > > locally, there is a good chance that 99.9% of the sstate
> > > > > could
> > > > > still
> > > > > be
> > > > > valid in some cases. Forcing the user through 10 hours of
> > > > > rebuild
> > > > > when
> > > > > potentially that much was available is a really really bad
> > > > > user
> > > > > experience.
> > > > 
> > > > Maybe there is a better idea.
> > > > 
> > > > > 
> > > > > >    That's what the is_sstate_mirror_available function
> > > > > >         could check to automatically enable and disable
> > > > > > lazy
> > > > > > downloads.
> > > > > >         
> > > > > >       - If is_sstate_mirror_available() returns false, it
> > > > > > should
> > > > > > still be
> > > > > >         possible to initiate a sstate-cache download
> > > > > > manually.
> > > > > >         
> > > > > >    * Terminology
> > > > > >       - Older Yocto Releases:
> > > > > >          + eSDK means an installer which provides a
> > > > > > different
> > > > > > environment with
> > > > > >            different tools
> > > > > >          + The eSDK was static, with a locked sstate cache
> > > > > >          + Was for one MACHINE, for one image...
> > > > > >       - Newer Yocto Releases:
> > > > > >          + The bitbake environment offers all features of
> > > > > > the
> > > > > > eSDK
> > > > > > installer. I
> > > > > >            consider this as already implemented with meta-
> > > > > > ide-
> > > > > > support
> > > > > > and
> > > > > >            build-sysroots.
> > > > > 
> > > > > Remember bblock and bbunlock too. These provide a way to fix
> > > > > or
> > > > > unlock
> > > > > specific sections of the codebase. Usually a developer has a
> > > > > pretty
> > > > > good idea of which bits they want to allow to change. I don't
> > > > > think
> > > > > people have yet realised/explored the potential these offer.
> > > > > 
> > > > 
> > > > Yes, I also started thinking about the possibilities we would
> > > > get
> > > > for
> > > > the SDK if there is a hash-server or an even more generic a
> > > > meta
> > > > data
> > > > server for the sstate-cache in the middle of the infrastructure
> > > > picture. it would probably solve some challenges which I could
> > > > not
> > > > find
> > > > a solution so far.
> > > 
> > > Using the standard download model/approach we already have a
> > > "generic" metadata
> > > server approach (and standard download URI supported by bitbake).
> > 
> > Not sure I understand which generic metadata approach you mean.
> 
> sstate-cache IS a generic metadata.  This is what I a was referring
> to.  The 
> server is an https server that services files.
> 
> i.e.
>   
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdownloads.yoctoproject.org%2F&data=05%7C01%7Cadrian.freihofer%40siemens.com%7Cb4221214d953463dc52908dbebd2789e%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C0%7C638363061424942158%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=hfF2%2BmgW0%2BYwzbJp17arSdcvSJBqbWXLFIPovVhPc%2BQ%3D&reserved=0
>   
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsstate.yoctoproject.org%2F&data=05%7C01%7Cadrian.freihofer%40siemens.com%7Cb4221214d953463dc52908dbebd2789e%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C0%7C638363061424942158%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=4djqxKxhTyufCyicDnLXKQDXEjpfvIac7gyHkeFSfFE%3D&reserved=0
> 
> 
> These and your custom entries can be added to SSTATE_MIRRORS and
> (ignoring PR 
> and hash equivalency) have "just worked".  Extending this would make
> it really 
> easy for end users to use the new functions without needing to do
> anything 
> additional.
> 
> The above also has advantage to the Yocto Project of CDN support.
> 
> --Mark
> 
> > Adrian
> > 
> > 
> > > The
> > > specialized approaches (prserver/hashserver) are where we run
> > > into
> > > issues
> > > because it's no longer "generic" and well understood by others. 
> > > Need
> > > to figure
> > > out a way for this all to work and allow the most "reasonable"
> > > re-use
> > > we can.
> > > 
> > > --Mark
> > > 
> > > > 
> > > > Thank you for your response.
> > > > 
> > > > Adrian
> > > > 
> > > > 
> > > > > Cheers,
> > > > > 
> > > > > Richard
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> 
> 
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#1853): 
https://lists.openembedded.org/g/openembedded-architecture/message/1853
Mute This Topic: https://lists.openembedded.org/mt/102320110/21656
Group Owner: [email protected]
Unsubscribe: https://lists.openembedded.org/g/openembedded-architecture/unsub 
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [Openembedded-architecture] [OE-core] Core workflow: sstate for all, bblock/bbunlock, tools for why is sstate not being reused?

Reply via email to