While workable, this seems to me to be less than ideal because any time
one scrapes something the process is fragile, needing manual intervention
to fix the scraper whenever some unforeseen change happens to the
structure of what's being scraped.

The formats don't change all that much, and the big benefit of scrapers is that they are executable documentation.

Before we had the current scrapers, there were various hand-written listings of SRFI support around the net. It was impossible to tell where they came from, how they had been assembled, and which parts were up to date. There was no way for a newcomer to replicate the results.

Any large-scale data aggregation effort should absolutely use scrapers, if only for documentation purposes. But it's also a good way to avoid human error.

There's also unnecessary bandwidth being wasted repeatedly downloading
tar files, and time spent uncompressing and searching through them for
what amounts to a relatively tiny bit of data.

These are non-issues. GitHub and GitLab have tons of bandwidth. Gambit is one of the biggest Schemes and running listings/gambit-head.sh takes only 4 seconds, including the time GitHub takes to generate us a tailor-made tar archive of Gambit's git master.

We should scrape all this from different implementations and package indexes, and aggregate it into one place where it's available as one JSON and/or S-expression file.

But it pays to make a distinction between source data and aggregated data. If one aggregator takes 5 seconds to scrape each source, it's not a problem.

Because there is no standard, the data you get from a an arbitrary
Scheme's tar file is going to be unstructured, requiring more custom
rules to extract it.

Wouldn't it be so much simpler if every Scheme published the desired
data in the desired format at could be directly, reliably consumed
without having to write any custom code to deal with unstructured data
in random locations?

It would, and this is most easily accomplished by adding an S-expression file to each Scheme's git repo.

GitHub and GitLab can make you a link to each raw file stored in any Git repo. E.g. <https://raw.githubusercontent.com/schemedoc/implementation-metadata/master/schemes/chicken.scm>. You can also change "master" in the URL to a different branch or tag. If the aggregator could look for a standard file in the repo, it wouldn't have to download the whole repo, and would take less than 1 second.

Scanning through each package's metadata might be the most reliable way
to do this, but there is still the question of how that metadata is made
available.

The source metadata would be in each package. Each package manager would scan all of its own packages and compile an index file. The aggregator that compiles the full SRFI support table for all implementations would then aggregate _that_ data :) We should serve the full table as HTML, JSON, and S-expressions so people can save time and easily machine-extract things directly from the full aggregated data.

There is currently no standard, to my knowledge, of how
packages are distributed or how their metadata is published.  Every
Scheme does it in their own way.  This is an opportunity for
standardization as well, with benefits to a metadata collection project.

All of that is correct.

I would be happy help in the immediate future, though I'm afraid prior
commitments might tear me away in the long run.

Don't worry about commitments; we can make a repo under <https://github.com/pre-srfi> and gradually work on it.

Reply via email to