Hi all,
I've been giving this some thought last week, and threw up this pull
request mostly as a draft / RFC.
https://github.com/apache/buildstream/pull/1997
Copied from the PR:
Some thoughts about this approach:
* This prioritizes machine readability and standardization of source
provenance and version information
Sources have a lot of freedom in how they implement things, and so we
may very well need to expand on the types and constants added here,
such as `SourceInfoMedium`, `SourceVersionType`, etc.
The idea here is to have greater certainty about how sources are
obtained, even if this cannot be covered by all currently existing
source implementations (e.g. I didn't initially add a `bzr` medium
for which we have a plugin, or a `cvs` medium for which we do not yet
have a plugin).
Aspirationally, forcing this data to be precise can allow adjacent
tooling to do useful things.
* This drops the freeform "public data" mentioned in the proposal
discussion
My rationale for this choice in this branch, is that ultimately we
want a data with a constant shape, and if for examlpe, we want the
user to be able to override or assist a source with determining the
reported *"version"*, then the `Source` implementation already has
everything it needs to do so:
* it can add additional configuration keys for the user to
configure
* it has the power to implement `collect_source_info()` however it
wants.
* This does not yet attempt to cover the concept of *tracking*
information
I would like to consider this, but we should think carefully about
how this can be useful. For instance, some git plugins have different
interpretations of what their "tracking" strings mean, sometimes
following a branch head, sometimes looking for the latest tag in
history which matches a given regular expression.
If we export this tracking information, it should probably be useful
for external tooling to figure out how to do the tracking and come to
the same conclusion, otherwise it is unclear what this is useful for.
* This does not cover the CVE information
While the SourceInfo objects representing a source's provenance is a
list, I believe that the CVE information continues to be a
per-element concept.
For example, when we have applied security patches to a module, those
security patches are, themselves, sources, with provenance of being
revisioned in the local project
Thoughts and input appreciated,
Cheers,
-Tristan
On Wed, 2024-12-11 at 19:56 +0900, Tristan van Berkom wrote:
> Hi Martín,
>
> On Thu, 2024-11-21 at 07:42 -0300, Martín Abente Lahaye wrote:
> > Hello everyone,
> >
> > Currently, some of our community plugins like collect_manifest [1]
> > and
> > tools like bst-to-lorry [2], rely on a combination of assumptions
> > based
> > on the reported “kind” of the Source and private Python APIs.
> >
> > This can be problematic as both collect_manifest and bst-to-lorry
> > query
> > sources for their "kind" and assume that certain attributes and
> > methods
> > will be present (e.g., source.url). In fact, this has been
> > discussed
> > before at least once [3].
> >
> > Although this seems to work, it’s unreliable because even if the
> > “kind”
> > string matches, there’s no guarantee that it is the expected
> > Source, as
> > it can be a different Source with the same “kind” string. Plus,
> > even if
> > it really is the expected Source, a future refactor could break
> > these
> > assumptions as these aren’t public APIs.
>
> Yes, well known issue, thanks for bringing this back to light :)
>
> [...]
> > Therefore, instead of letting plugins and tools do all that
> > unreliable
> > guessing, we could provide what these ultimately need by adding
> > new
> > abstract methods that each Source can implement. For example,
> > something
> > like:
> >
> > * Source.get_urls(self) -> List[str]: Which would provide a list of
> > full
> > upstream URL without any guessing or relying on accessing private
> > attributes, for the caller.
> > * Source.get_versions(self) -> List[str]: Similarly, for the
> > versions,
> > but the tricky piece with this would be the need for a regexp for
> > each
> > source, e.g., in case the version needs to be extracted from the
> > Source
> > URL.
> > * Source.get_trackings(self) -> List[Optional[str]]: Similarly, for
> > the
> > tracking strings. Of course this would only make sense for sources
> > that
> > can actually be tracked.
>
> I'm very much in favor of crafting Source APIs for Sources to report
> common information about Sources in a standard API path.
>
> This has the advantage of being easy enough to implement on Source
> implementations once, and therefore be leveraged on a multitude of
> projects with little or no effort (asides from perhaps having those
> projects *use* the new plugin versions which support these new APIs).
>
> Into some specifics:
>
> * Given that a Source may have multiple URIs, refs, and tracking
> informations, I think probably a more natural API would be to have
> something like a SourceInfo object defined, and ask Source
> implementations to return a list of them (e.g. list_source_info())
>
> This is mostly just for a pretty API, it's easier this way to know
> what information belongs together.
>
> In 99% of cases, this will return a single entry list.
>
> * Source.get_versions() is ambiguous to me.
>
> From the BuildStream perspective, what is a "version" of a Source
> ?
>
> Is it necessary to know the "version" of a Source without having
> the
> Source data cached locally to compute it ?
>
> It seems to me that we probably want to fetch the source first, so
> that the Source implementation has the liberty to interrogate the
> data in order to guess what a "version" is, perhaps by invoking
> things like `git describe`
>
> I'm not convinced that deriving this information from the URL
> alone,
> or even the URL and the ref, is sufficient for the Source to make
> a qualified "guess" at the version.
>
> In any case, this will likely be a "guess" no matter how the
> plugin computes this "version".
>
>
> > Or perhaps, something that better groups these tuples, e.g.,
> > Source.get_actual_sources(self) -> List[Tuple[str, str,
> > Optional[str]]],
> > providing URL, version and tracking strings tuples or equivalent
> > object.
>
> Ah yes, as I mentioned above, however I would prefer BuildStream
> qualified objects, which are extensible in the future, with some
> versioning strategy, rather than rigidly defining a tuple.
>
> > A key question here is whether something like the above would still
> > be
> > too rigid or over-specified for that plugin and tool, and perhaps
> > we
> > should be thinking of a more free-form API to query for these.
> >
> > An idea that was mentioned when discussing this topic with
> > Abderrahim
> > was to introduce to sources something similar to the elements
> > public
> > data, e.g., this way we could add that version matching regexp.
>
> I think we *also* want public data for this.
>
> I.e. the Source cannot make a fully qualified "guess" at the
> "version",
> and there may be other attributes we want to associate with a
> "source".
>
> As such, the project author should have the authority to override the
> Source implementation's "guess" by explicitly mentioning it in the
> bst
> file.
>
> Further, I would consider this proposal incomplete without adding
> CLI
> support to interrogate this information with.
>
> For the collect_manifest problem, I hope that we can address this
> without needing to use a plugin at all, we should be able to get away
> with only:
>
> * bst source fetch --deps all <ELEMENT>
> * bst source show --deps all --format <FORMAT> <ELEMENT>
>
> Cheers,
> -Tristan
>
>
>
>
>