Re: [Openembedded-architecture] [OE-core] Core workflow: sstate for all, bblock/bbunlock, tools for why is sstate not being reused?

Adrian Freihofer Wed, 22 Nov 2023 06:56:52 -0800

> While being able to support ssh (or a related protocol) is useful,
> you need to 
> also remember that MANY MANY organizations absolutely block SSH
> access through 
> their firewalls.  So _requiring_ sftp would be bad.  Allowing it's
> usage would 
> be good.
> 
> As for logging in, https is transport 'security' but not
> authentication without 
> additional helpers.  I think it's absolutely reasonable to say https
> access 
> either needs an external helper for authentication purposes or it's 
> un-authenticated.
Yes, there should be an option for a authentication maybe before the
server and a helper for doing authorization.


>   If you want (internal to the company) then ssh/sftp or 
> similar using the ssh-agent (or similar) should be the suggested
> approach.
> 
> We don't want to exclude anyone, but we want to be clear on the
> limitations 
> based on an organization's specific choice.
> 
You are right. https is what everybody (including us) really wants.

> > If one then wants to scale an sstate-cache server for many
> > different
> > projects and users, one quickly wishes for an option for
> > authorization
> > at artifact level. Ideally, the access rights to the source code
> > would
> > be completely transferred to the associated sstate artifacts. For
> > such
> > an authorization the ssate mirror server would require the SRC_URI
> > which was used to compile the sstate artifact. With this
> > information,
> > it could ask the Git server whether or not a user has access to all
> > source code repositories to grant or deny access to a particular
> > sstate
> > artifact. It should not be forgotten that the access rights to the
> > Git
> > repositories can change.
> 
> In my experience you do not use _one_ sstate-cache for multiple
> projects (at an 
> organization level), each project is responsible for it's own cache. 
> This 
> prevents even the possibility that one project could use code not
> intended for it.

But the sstate cache is already perfectly suited for deduplicated
archiving. The idea of storing all artifacts on a file server and
storing additional metadata that enables authorization, storage and
backup is therefore obvious when it comes to scaling.

> 
>  From a more generic Yocto Project perspective, this means you really
> want to 
> use a hierarchy of sstate-caches.  (Maybe not a true hierarchy.).
> I.e. I use YP, 
> so I get the YP sstate-cache for the base functionality.  I use 
> meta-openembedded, so I want the meta-openembedded cache... project
> A, I want 
> the project's cache, OE and YP caches as well..  project B, I want
> that projects 
> cache, OE and YP caches.  Project C?  I might want it's cache,
> Project A, 
> Project B, and OE and YP.   You can see this gets complicated
> quickly.
> 
> If this either isn't inteded or a good idea, then alternatives need
> to be 
> provided for this.  Everyone always ends up with an upstream provider
> (or 
> providers) be it YP, OE, OSVs, ISVs, local company resources, etc. 
> How do we 
> manage this and keep it aligned?

I tried to look at it in a hierarchical way. But the longer I looked
into it, the less sense it made. The hierarchy is not the most logical
thing a sstate cache really has. The most logical thing is the content
addressable file names. I see more similarities with e.g. a git tree or
a container registry than with a hierarchical system. There are content
addressable objects and a database with meta information for organizing
these objects.

> 
> Bring in hash equivalency and PR service and things get complicated. 
> The 
> sstate-cache itself is NOT separable from those services.  There are
> ways to 
> decouple them, but they can be 'extreme'.  I.e. turn off hash-
> equivalency, no 
> need for a hash-equivalency service.   Don't cache the
> do_package_write* files, 
> no PR service....  (but even that isn't fool proof due to git
> AUTOINC... so you 
> end up seeding the AUTOINC with static entries or some other
> method...)
> 
> All of these items need to be dealt with and documented together.  My
> PERSONAL 
> preference, (without knowing any specific implementation details) is
> that the 
> contents of hash-equivalency and PR service is somehow stored with
> the sstate-cache.

That's basically the database which is needed to organize the sstate-
cache. If bitbake would add more information to this database when
uploading the sstate-cache artifacts, the database could be used for
more use cases such as authorization or data retention. One example
could be the SRC_URI. It would e.g. help with the authorization on
artifact level (the user must have access to all repositories refereed
by the SRC_URI to get access to the sstate artifact as well or with the
organization of an internal source mirror (each source artifact
compiled into an sstate artifact should be on the mirror).

> 
> One possible way this could be done..  System starts up, determines
> it needs 
> something it doesn't have, then goes out and checks if an updated
> index is 
> present. 
>  If it is, downloads it adds to it's hash equivalency server.  If no 
> index present, it can then look for the file lets say
> "sstate:....link".  If 
> that comes back, we know we have an equivalency, it's downloaded
> added to the 
> local database and then the pointed to file is retrieved..  (.siginfo
> and 
> .tar.xz or whatever).  This would ensure that the index is an
> optimization, but 
> not a requirement and would allow a "live" sstate-cache while losing
> some 
> performance.  (This doesn't negate any of the comments about rights
> or 
> possibility to DoS a server via too many connections!)

Not sure I got this idea. But due to the fact that the sstate is
content addressable, clients can calculate the index of the sstate-
cache locally based on the layer data. If a client does something like:

bitbake core-image-minimal --setscene-only -DD | grep "^DEBUG: SState:
Looked for but didn't find file "

bitbake prints all the missing sstate artifacts.

With hash equivalence the missing artifacts have to be mapped to the
available artifacts as an additional step which then probably makes a
central server mandatory.

> 
> Still have to solve the PR service problem, but this could get
> 'seeded' via the 
> associated do_write_package siginfo or similar..  and for the
> AUTOINC, seed it 
> from the siginfo file for a given hash?
> 
> Doing something like the above could then allow the order specificed
> in the 
> SSTATE_MIRRORS to be used to truely indicate the order things are
> resolved and 
> loaded.
> > > Recently we've been wondering about teaching the hashequiv server
> > > about
> > > "presence", which would then mean the build would only query
> > > things
> > > that stood a good chance of existing.
> > > 
> > Yes, that sound very interesting. There are probably even more such
> > kind of meta data which could be provided by the hashserver to
> > improve
> > the management of a shared sstate mirror.
> > 
> > Would it make sense to include e.g. the SRC_URI in the hashserv
> > database and extend the hashserver's API to also provide meta data
> > e.g.
> > for the authorization of the sstate-mirror? Or is security and
> > authorization something which should be handled independently from
> > hash
> > equivalence?
> 
> The more I've thought about this, any sort of query directly to a
> remote 
> hashservice seems more and more problematic..  Local hash database,
> absolutely 
> needed as an optimization.
> 
> There is a second problem.  My org for instant, it's easy for me to
> request 
> https server where I can serve files to the public.  But asking for
> our IT to 
> support a hash equivalency (and pr) server?  This will likely take
> months of 
> negotiation, possible security review, mitigation process, etc etc
> etc.. and no 
> guaranty that it will actually get approved.  I expect other people
> will be in a 
> similar situation.

Yes, this is often the case. Functions such as hash equivalence are not
helpful in this respect. But they are also too great to be considered
completely optional ;-).

With ongoing cloudification it's also more and more common that IT
departments support containerized applications and not only static web
servers. But usually fully authenticated https is the only allowed
protocol.

It would be really nice to document or even develop some containers and
terraforms or Kubernetes yaml files which are usable to build a build
cluster, sstate mirror, pr server, sources mirror infrastructure. It's
getting more and more complicated to do this at scale and with security
in mind. Basic idea:

source-mirror---|   auth        / <-rw-> build-server(s)
sstate-storage--|     |         |
                |----API -------|
sstate-db ------|     |         |
pr-server ------|     |         |
                      |         \ --ro-> build clients with SDK
                      |
              export to files
           (SDK, mirror, backup)

> 
> > Another topic where additional meta data about the sstate-cache
> > seams
> > to be beneficial is sstate-mirror retention. Knowing which artifact
> > was
> > compiled for which tag or commit of the bitbake layer could help to
> > wipe out some artifacts which are not needed anymore.
> > 
> > > >      - A script which gets a list of sstate artifacts from
> > > > bitbake
> > > > and then
> > > >        does a upfront download works much better
> > > >         + The script runs only when the user calls it or the
> > > > SDK
> > > > gets boot-
> > > >           strapped
> > > >         + The script uses a reasonable amount of parallel
> > > > connections which
> > > >           are re-used for more then one artifact download
> > > 
> > > Explaining to users they need to do X before Y quickly gets
> > > tiring,
> > > both for people explaining it and the people doing it trying to
> > > remember. I'd really like to get to a point where the system
> > > "does
> > > the
> > > right thing" if we can.
> > > 
> > > I don't believe the problems you describe are insurmountable. If
> > > you
> > > are using sftp, that is going to be a big chunk of the problem as
> > > the
> > > system assumes something faster is available. Yes, I've taken
> > > patches
> > > to make sftp work but it isn't recommended at all. I appreciate
> > > there
> > > would be reasons why you use sftp but if it is possible to get a
> > > list
> > > of "available sstate" via other means, it would improve things.
> > > 
> > > >   * Idea for a smart lock/unlock implementation
> > > >      - Form a user's perspective a locked vs. an unlocked SDK
> > > > does
> > > > not make
> > > >        much sense. It makes more sense if the SDK would
> > > > automatically
> > > >        download the sstate-cache if it is expected to be
> > > > available.
> > > >        Lets think about an implementation (which allows to
> > > > override
> > > > the
> > > >        logic) to switch from automatic to manual mode:
> > > >        
> > > >        SSTATE_MIRRORS_ENABLED ?=
> > > > "${is_sstate_mirror_available()}"
> > > 
> > > What determines this availability? I worry that is something very
> > > fragile and specific to your use case. It is also not an all or
> > > nothing
> > > binary thing.
> > 
> > It would probably be better to query a harserver if an artifact is
> > present.
> > > 
> > > >        In our case the sstate mirror is expected to provide all
> > > > artifacts
> > > >        for tagged commits and for some git branches of the
> > > > layer
> > > >        repositories.
> > > >        The sstate is obviousely not usable for a "dirty" git
> > > > layer
> > > >        repository.
> > > 
> > > That isn't correct and isn't going to work. If I make a single
> > > change
> > > locally, there is a good chance that 99.9% of the sstate could
> > > still
> > > be
> > > valid in some cases. Forcing the user through 10 hours of rebuild
> > > when
> > > potentially that much was available is a really really bad user
> > > experience.
> > 
> > Maybe there is a better idea.
> > 
> > > 
> > > >   That's what the is_sstate_mirror_available function
> > > >        could check to automatically enable and disable lazy
> > > > downloads.
> > > >        
> > > >      - If is_sstate_mirror_available() returns false, it should
> > > > still be
> > > >        possible to initiate a sstate-cache download manually.
> > > >        
> > > >   * Terminology
> > > >      - Older Yocto Releases:
> > > >         + eSDK means an installer which provides a different
> > > > environment with
> > > >           different tools
> > > >         + The eSDK was static, with a locked sstate cache
> > > >         + Was for one MACHINE, for one image...
> > > >      - Newer Yocto Releases:
> > > >         + The bitbake environment offers all features of the
> > > > eSDK
> > > > installer. I
> > > >           consider this as already implemented with meta-ide-
> > > > support
> > > > and
> > > >           build-sysroots.
> > > 
> > > Remember bblock and bbunlock too. These provide a way to fix or
> > > unlock
> > > specific sections of the codebase. Usually a developer has a
> > > pretty
> > > good idea of which bits they want to allow to change. I don't
> > > think
> > > people have yet realised/explored the potential these offer.
> > > 
> > 
> > Yes, I also started thinking about the possibilities we would get
> > for
> > the SDK if there is a hash-server or an even more generic a meta
> > data
> > server for the sstate-cache in the middle of the infrastructure
> > picture. it would probably solve some challenges which I could not
> > find
> > a solution so far.
> 
> Using the standard download model/approach we already have a
> "generic" metadata 
> server approach (and standard download URI supported by bitbake). 

Not sure I understand which generic metadata approach you mean.

Adrian


> The 
> specialized approaches (prserver/hashserver) are where we run into
> issues 
> because it's no longer "generic" and well understood by others.  Need
> to figure 
> out a way for this all to work and allow the most "reasonable" re-use
> we can.
> 
> --Mark
> 
> > 
> > Thank you for your response.
> > 
> > Adrian
> > 
> > 
> > > Cheers,
> > > 
> > > Richard
> > 
> > 
> > 
> > 
> >

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#1845): 
https://lists.openembedded.org/g/openembedded-architecture/message/1845
Mute This Topic: https://lists.openembedded.org/mt/102320110/21656
Group Owner: [email protected]
Unsubscribe: https://lists.openembedded.org/g/openembedded-architecture/unsub 
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [Openembedded-architecture] [OE-core] Core workflow: sstate for all, bblock/bbunlock, tools for why is sstate not being reused?

Reply via email to