Re: [Openembedded-architecture] [OE-core] Core workflow: sstate for all, bblock/bbunlock, tools for why is sstate not being reused?

Mark Hatle Wed, 22 Nov 2023 19:15:41 -0800


On 11/22/23 8:56 AM, [email protected] wrote:

While being able to support ssh (or a related protocol) is useful,
you need to
also remember that MANY MANY organizations absolutely block SSH
access through
their firewalls.  So _requiring_ sftp would be bad.  Allowing it's
usage would
be good.

As for logging in, https is transport 'security' but not
authentication without
additional helpers.  I think it's absolutely reasonable to say https
access
either needs an external helper for authentication purposes or it's
un-authenticated.

Yes, there should be an option for a authentication maybe before the
server and a helper for doing authorization.

   If you want (internal to the company) then ssh/sftp or
similar using the ssh-agent (or similar) should be the suggested
approach.

We don't want to exclude anyone, but we want to be clear on the
limitations
based on an organization's specific choice.

You are right. https is what everybody (including us) really wants.

If one then wants to scale an sstate-cache server for many
different
projects and users, one quickly wishes for an option for
authorization
at artifact level. Ideally, the access rights to the source code
would
be completely transferred to the associated sstate artifacts. For
such
an authorization the ssate mirror server would require the SRC_URI
which was used to compile the sstate artifact. With this
information,
it could ask the Git server whether or not a user has access to all
source code repositories to grant or deny access to a particular
sstate
artifact. It should not be forgotten that the access rights to the
Git
repositories can change.


In my experience you do not use _one_ sstate-cache for multiple
projects (at an
organization level), each project is responsible for it's own cache.
This
prevents even the possibility that one project could use code not
intended for it.


But the sstate cache is already perfectly suited for deduplicated
archiving. The idea of storing all artifacts on a file server and
storing additional metadata that enables authorization, storage and
backup is therefore obvious when it comes to scaling.


  From a more generic Yocto Project perspective, this means you really
want to
use a hierarchy of sstate-caches.  (Maybe not a true hierarchy.).
I.e. I use YP,
so I get the YP sstate-cache for the base functionality.  I use
meta-openembedded, so I want the meta-openembedded cache... project
A, I want
the project's cache, OE and YP caches as well..  project B, I want
that projects
cache, OE and YP caches.  Project C?  I might want it's cache,
Project A,
Project B, and OE and YP.   You can see this gets complicated
quickly.

If this either isn't inteded or a good idea, then alternatives need
to be
provided for this.  Everyone always ends up with an upstream provider
(or
providers) be it YP, OE, OSVs, ISVs, local company resources, etc.
How do we
manage this and keep it aligned?


I tried to look at it in a hierarchical way. But the longer I looked
into it, the less sense it made. The hierarchy is not the most logical
thing a sstate cache really has. The most logical thing is the content
addressable file names. I see more similarities with e.g. a git tree or
a container registry than with a hierarchical system. There are content
addressable objects and a database with meta information for organizing
these objects.

I agree.. but where I think hierarchy is trust. I trust YP to be the firstplace to check, I trust my OSV, my ISV, etc.. The reality though is git may bea better way to think of it. It's a collection of hashes from multiple sources.The only priority here is the order in which we search and if you include'untrusted' sources it's at your own risk (of course this is when sstate-cachesigning becomes importan.....)


Bring in hash equivalency and PR service and things get complicated.
The
sstate-cache itself is NOT separable from those services.  There are
ways to
decouple them, but they can be 'extreme'.  I.e. turn off hash-
equivalency, no
need for a hash-equivalency service.   Don't cache the
do_package_write* files,
no PR service....  (but even that isn't fool proof due to git
AUTOINC... so you
end up seeding the AUTOINC with static entries or some other
method...)

All of these items need to be dealt with and documented together.  My
PERSONAL
preference, (without knowing any specific implementation details) is
that the
contents of hash-equivalency and PR service is somehow stored with
the sstate-cache.


That's basically the database which is needed to organize the sstate-
cache. If bitbake would add more information to this database when
uploading the sstate-cache artifacts, the database could be used for
more use cases such as authorization or data retention. One example
could be the SRC_URI. It would e.g. help with the authorization on
artifact level (the user must have access to all repositories refereed
by the SRC_URI to get access to the sstate artifact as well or with the
organization of an internal source mirror (each source artifact
compiled into an sstate artifact should be on the mirror).

As an optimization the local system needs a database (PR service and hashequivalency). But the data store and retrieval is better in a "filesystem" likestorage. So how do we move from one to the other (sstate-cache is already usinga filesystem like storage to do many things, so this becomes an extension.)

This allows people to then serve/retrieval sstate-cache via https from a webserver, a local artifactory, etc etc etc. Along with the sstate-cache theyautomatically get the matchinng hash-equivalencey and PR service data which getsinjected into their local database.

So you get into a situation where you calculate the hash of what you need, andretrieve it. The retrieval could be (the equivalent of) a link that has pointerinformation to the 'real' object, and the link gets injected into the local hashequivalency server and then the 'real' object downloaded.. (repeat as needed).If the object requested uses 'PR' data, then it should include the appropriatedata, which can be observerd and used to update the local PR server.. i.e.

calculate do_write_package is hash abcde, we download abcde, inspect it for PR =r0.1, then write PN, PV, PE, PR (hash abcde) into the PR database and we're gota new matching entry, and possibly an updated "newest AUTOINC" version.(Obviously it's not THAT easy, but that's the idea and it keeps it all centralon the sstate-cache.) The key is when the sstate-cache is written out, do these'links' and PR service data get included (always) or do we need an export stepto add them.. (I'd prefer the former, but I've no idea what it does to buildperformance!)


One possible way this could be done..  System starts up, determines
it needs
something it doesn't have, then goes out and checks if an updated
index is
present.
  If it is, downloads it adds to it's hash equivalency server.  If no
index present, it can then look for the file lets say
"sstate:....link".  If
that comes back, we know we have an equivalency, it's downloaded
added to the
local database and then the pointed to file is retrieved..  (.siginfo
and
.tar.xz or whatever).  This would ensure that the index is an
optimization, but
not a requirement and would allow a "live" sstate-cache while losing
some
performance.  (This doesn't negate any of the comments about rights
or
possibility to DoS a server via too many connections!)


Not sure I got this idea. But due to the fact that the sstate is
content addressable, clients can calculate the index of the sstate-
cache locally based on the layer data. If a client does something like:

bitbake core-image-minimal --setscene-only -DD | grep "^DEBUG: SState:
Looked for but didn't find file "

bitbake prints all the missing sstate artifacts.

With hash equivalence the missing artifacts have to be mapped to the
available artifacts as an additional step which then probably makes a
central server mandatory.


Still have to solve the PR service problem, but this could get
'seeded' via the
associated do_write_package siginfo or similar..  and for the
AUTOINC, seed it
from the siginfo file for a given hash?

Doing something like the above could then allow the order specificed
in the
SSTATE_MIRRORS to be used to truely indicate the order things are
resolved and
loaded.

Recently we've been wondering about teaching the hashequiv server
about
"presence", which would then mean the build would only query
things
that stood a good chance of existing.

Yes, that sound very interesting. There are probably even more such
kind of meta data which could be provided by the hashserver to
improve
the management of a shared sstate mirror.

Would it make sense to include e.g. the SRC_URI in the hashserv
database and extend the hashserver's API to also provide meta data
e.g.
for the authorization of the sstate-mirror? Or is security and
authorization something which should be handled independently from
hash
equivalence?


The more I've thought about this, any sort of query directly to a
remote
hashservice seems more and more problematic..  Local hash database,
absolutely
needed as an optimization.

There is a second problem.  My org for instant, it's easy for me to
request
https server where I can serve files to the public.  But asking for
our IT to
support a hash equivalency (and pr) server?  This will likely take
months of
negotiation, possible security review, mitigation process, etc etc
etc.. and no
guaranty that it will actually get approved.  I expect other people
will be in a
similar situation.


Yes, this is often the case. Functions such as hash equivalence are not
helpful in this respect. But they are also too great to be considered
completely optional ;-).

With ongoing cloudification it's also more and more common that IT
departments support containerized applications and not only static web
servers. But usually fully authenticated https is the only allowed
protocol.

It would be really nice to document or even develop some containers and
terraforms or Kubernetes yaml files which are usable to build a build
cluster, sstate mirror, pr server, sources mirror infrastructure. It's
getting more and more complicated to do this at scale and with security
in mind. Basic idea:

source-mirror---|   auth        / <-rw-> build-server(s)
sstate-storage--|     |         |
                 |----API -------|
sstate-db ------|     |         |
pr-server ------|     |         |
                       |         \ --ro-> build clients with SDK
                       |
               export to files
            (SDK, mirror, backup) >

My experience is that getting a new service added (and as you said https withauthentication is often required) is still more difficult then using existingfile services such as artifactory or just a simple file store that alreadyexists. (And IT departments are much more willing to service pure files, theninstantiate a service that now has to be monitored for security and updatedregularly when there is no one "responsible" for this. The cloud service modelhas moved a lot of things to outside vendors as the ones responsible for theseactions and the YP components don't necessarily fit this model.)

Mind you for someone who CAN do this, it very well could be a more efficientprotocol, especially within an organization that may have different sites butstill remains inside an intranet.. but moving outside to the internet itselfadding services can be difficult.

Another topic where additional meta data about the sstate-cache
seams
to be beneficial is sstate-mirror retention. Knowing which artifact
was
compiled for which tag or commit of the bitbake layer could help to
wipe out some artifacts which are not needed anymore.

      - A script which gets a list of sstate artifacts from
bitbake
and then
        does a upfront download works much better
         + The script runs only when the user calls it or the
SDK
gets boot-
           strapped
         + The script uses a reasonable amount of parallel
connections which
           are re-used for more then one artifact download


Explaining to users they need to do X before Y quickly gets
tiring,
both for people explaining it and the people doing it trying to
remember. I'd really like to get to a point where the system
"does
the
right thing" if we can.

I don't believe the problems you describe are insurmountable. If
you
are using sftp, that is going to be a big chunk of the problem as
the
system assumes something faster is available. Yes, I've taken
patches
to make sftp work but it isn't recommended at all. I appreciate
there
would be reasons why you use sftp but if it is possible to get a
list
of "available sstate" via other means, it would improve things.

   * Idea for a smart lock/unlock implementation
      - Form a user's perspective a locked vs. an unlocked SDK
does
not make
        much sense. It makes more sense if the SDK would
automatically
        download the sstate-cache if it is expected to be
available.
        Lets think about an implementation (which allows to
override
the
        logic) to switch from automatic to manual mode:

SSTATE_MIRRORS_ENABLED ?=

"${is_sstate_mirror_available()}"


What determines this availability? I worry that is something very
fragile and specific to your use case. It is also not an all or
nothing
binary thing.


It would probably be better to query a harserver if an artifact is
present.

        In our case the sstate mirror is expected to provide all
artifacts
        for tagged commits and for some git branches of the
layer
        repositories.
        The sstate is obviousely not usable for a "dirty" git
layer
        repository.


That isn't correct and isn't going to work. If I make a single
change
locally, there is a good chance that 99.9% of the sstate could
still
be
valid in some cases. Forcing the user through 10 hours of rebuild
when
potentially that much was available is a really really bad user
experience.


Maybe there is a better idea.

   That's what the is_sstate_mirror_available function
        could check to automatically enable and disable lazy
downloads.

- If is_sstate_mirror_available() returns false, it should

still be
        possible to initiate a sstate-cache download manually.

* Terminology

      - Older Yocto Releases:
         + eSDK means an installer which provides a different
environment with
           different tools
         + The eSDK was static, with a locked sstate cache
         + Was for one MACHINE, for one image...
      - Newer Yocto Releases:
         + The bitbake environment offers all features of the
eSDK
installer. I
           consider this as already implemented with meta-ide-
support
and
           build-sysroots.


Remember bblock and bbunlock too. These provide a way to fix or
unlock
specific sections of the codebase. Usually a developer has a
pretty
good idea of which bits they want to allow to change. I don't
think
people have yet realised/explored the potential these offer.


Yes, I also started thinking about the possibilities we would get
for
the SDK if there is a hash-server or an even more generic a meta
data
server for the sstate-cache in the middle of the infrastructure
picture. it would probably solve some challenges which I could not
find
a solution so far.


Using the standard download model/approach we already have a
"generic" metadata
server approach (and standard download URI supported by bitbake).


Not sure I understand which generic metadata approach you mean.

sstate-cache IS a generic metadata. This is what I a was referring to. Theserver is an https server that services files.


i.e.
  https://downloads.yoctoproject.org
  https://sstate.yoctoproject.org

These and your custom entries can be added to SSTATE_MIRRORS and (ignoring PRand hash equivalency) have "just worked". Extending this would make it reallyeasy for end users to use the new functions without needing to do anythingadditional.


The above also has advantage to the Yocto Project of CDN support.

--Mark

Adrian

The
specialized approaches (prserver/hashserver) are where we run into
issues
because it's no longer "generic" and well understood by others.  Need
to figure
out a way for this all to work and allow the most "reasonable" re-use
we can.

--Mark


Thank you for your response.

Adrian

Cheers,

Richard

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#1848): 
https://lists.openembedded.org/g/openembedded-architecture/message/1848
Mute This Topic: https://lists.openembedded.org/mt/102320110/21656
Group Owner: [email protected]
Unsubscribe: https://lists.openembedded.org/g/openembedded-architecture/unsub 
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [Openembedded-architecture] [OE-core] Core workflow: sstate for all, bblock/bbunlock, tools for why is sstate not being reused?

Reply via email to