Re: SRFI-metadata-syncing SRFI?

Lassi Kortela Mon, 09 Nov 2020 08:36:47 -0800

While workable, this seems to me to be less than ideal because any time
one scrapes something the process is fragile, needing manual intervention
to fix the scraper whenever some unforeseen change happens to the
structure of what's being scraped.

The formats don't change all that much, and the big benefit of scrapersis that they are executable documentation.

Before we had the current scrapers, there were various hand-writtenlistings of SRFI support around the net. It was impossible to tell wherethey came from, how they had been assembled, and which parts were up todate. There was no way for a newcomer to replicate the results.

Any large-scale data aggregation effort should absolutely use scrapers,if only for documentation purposes. But it's also a good way to avoidhuman error.

There's also unnecessary bandwidth being wasted repeatedly downloading
tar files, and time spent uncompressing and searching through them for
what amounts to a relatively tiny bit of data.

These are non-issues. GitHub and GitLab have tons of bandwidth. Gambitis one of the biggest Schemes and running listings/gambit-head.sh takesonly 4 seconds, including the time GitHub takes to generate us atailor-made tar archive of Gambit's git master.

We should scrape all this from different implementations and packageindexes, and aggregate it into one place where it's available as oneJSON and/or S-expression file.

But it pays to make a distinction between source data and aggregateddata. If one aggregator takes 5 seconds to scrape each source, it's nota problem.

Because there is no standard, the data you get from a an arbitrary
Scheme's tar file is going to be unstructured, requiring more custom
rules to extract it.

Wouldn't it be so much simpler if every Scheme published the desired
data in the desired format at could be directly, reliably consumed
without having to write any custom code to deal with unstructured data
in random locations?

It would, and this is most easily accomplished by adding an S-expressionfile to each Scheme's git repo.

GitHub and GitLab can make you a link to each raw file stored in any Gitrepo. E.g.<https://raw.githubusercontent.com/schemedoc/implementation-metadata/master/schemes/chicken.scm>.You can also change "master" in the URL to a different branch or tag. Ifthe aggregator could look for a standard file in the repo, it wouldn'thave to download the whole repo, and would take less than 1 second.

Scanning through each package's metadata might be the most reliable way
to do this, but there is still the question of how that metadata is made
available.

The source metadata would be in each package. Each package manager wouldscan all of its own packages and compile an index file. The aggregatorthat compiles the full SRFI support table for all implementations wouldthen aggregate _that_ data :) We should serve the full table as HTML,JSON, and S-expressions so people can save time and easilymachine-extract things directly from the full aggregated data.

There is currently no standard, to my knowledge, of how
packages are distributed or how their metadata is published.  Every
Scheme does it in their own way.  This is an opportunity for
standardization as well, with benefits to a metadata collection project.


All of that is correct.

I would be happy help in the immediate future, though I'm afraid prior
commitments might tear me away in the long run.

Don't worry about commitments; we can make a repo under<https://github.com/pre-srfi> and gradually work on it.

Re: SRFI-metadata-syncing SRFI?

Reply via email to