Re: [aur-general] AUR package metadata dump

2018-11-16 Thread Eli Schwartz via aur-general
On 11/16/18 2:11 PM, Thore Bödecker via aur-general wrote:
> Anywho, I just wanted to put this out there and gather some thoughts, feedback
> and opinions on this.

This discussion is at this point, no longer "hey, do you know what
happened and why this doesn't work?"

I think at this point (we're beginning to propose high-level code
concepts and make highly specific requests) we are ready to move the
discussion to the aur-dev mailing list. :)

Thanks.

-- 
Eli Schwartz
Bug Wrangler and Trusted User



signature.asc
Description: OpenPGP digital signature


Re: [aur-general] AUR package metadata dump

2018-11-16 Thread Thore Bödecker via aur-general
On 16.11.18 - 18:27, Florian Pritz via aur-general wrote:
> My idea is to either generate the results on demand or cache them in the
> code. If cached in the code, there would be no database load. It would
> just pass through the code so we can perform rate limiting. Granted, if
> we can implemented the rate limit in nginx (see below) that would be
> essentially the same and fine too. Then we/you could indeed just dump it
> to a file and serve that.

I've just been discussing an idea with Florian, which might provide a
reasonable way for both sides:

The clients could send a timestamp to the API, that implies some sort
of "give me all updates since *that*".
The update timestamps for the packages are already tracked in the
database anyway, putting an index on that column would make requesting
various ranges quite efficient.

When there were no changes since the client-supplied timestamp, the
API could respond with HTTP 304 "Not Modified" (possibly without a
body even) which would provide a suitable meaning and a very tiny
response.
We could actually think about not logging those 304 responses then,
dunno what the general opinion on that is.

If a new client wants to get started and build his own archive, it
could supply the timestamp "since=0" (e.g. talking about unix
timestamps here), which would simply result in a response with all
packages.
To prevent abuse of such (very large) deltas, we could implement some
sort of shared rate-limit, like Florian mentioned.
A first idea would be to use a shared rate-limit for all requests with
a timestamp older than 48 hours for example. We could allow something
like 200 of such requests per hour and if that limit was exceeded, the
API would reply with maybe HTTP 400 "Bad Request" or HTTP 412
"Precondition Failed", along with a "Retry-After: [0-9]+" header to
tell the client when to try again.


Anywho, I just wanted to put this out there and gather some thoughts, feedback
and opinions on this.


Cheers,
Thore

-- 
Thore Bödecker

GPG ID: 0xD622431AF8DB80F3
GPG FP: 0F96 559D 3556 24FC 2226  A864 D622 431A F8DB 80F3


signature.asc
Description: PGP signature


Re: [aur-general] AUR package metadata dump

2018-11-16 Thread Florian Pritz via aur-general
On Fri, Nov 16, 2018 at 05:35:28PM +0300, Dmitry Marakasov  
wrote:
> - Much less load on the server.
> 
>   I've looked through API code and it does an extra SQL query per a
>   package to get extended data such as dependencies and licenses, which
>   consists of multiple unions and joins involving 10 tables. That looks
>   extremely heavy, and getting a dump through API is equivalent to
>   issuing this heavy query 53k times (e.g. for each package).

Actually the database load of the current API is so low (possibly due to
the mysql query cache) that we failed to measure a difference when we
put it behind a 10 minute cache via nginx. The most noticeable effect of
API requests is the size of the log file. Well, and then, I am on
principle against runaway scripts that generate unnecessary requests.
The log file that filled up the disk was the primary trigger to look
into this though.

My idea is to either generate the results on demand or cache them in the
code. If cached in the code, there would be no database load. It would
just pass through the code so we can perform rate limiting. Granted, if
we can implemented the rate limit in nginx (see below) that would be
essentially the same and fine too. Then we/you could indeed just dump it
to a file and serve that.

> I don't think that the existence of the dump will encourage clients
> which don't need ALL the data as it's still heavier to download and
> decompress - doing that every X seconds will create noticeable load
> on the clients asking to redo it in a proper way.

You'll be amazed what ideas people come up with and what they don't
notice. Someone once though that it would be a good idea to have a
script that regularly (I think daily) fetches a sorted mirror list from
our web site and then reuses that without modification. Obviously if
many people use that solution and all use the same sort order, which was
intended by the script author, they all have the same mirror in the
first line and thus that mirror becomes overloaded quite quickly.

> It can also still be rate limited (separately from API, and probably with
> much lower rate, e.g. 4 RPH looks reasonable) - I see aur.archlinux.org
> uses nginx, and it supports such rate limiting pretty well.

How would you configure a limit of 4/hour? Last time I checked nginx
only supported limits per second and per minute and no arbitrary time
frame nor non-integer values. This still seems to be the case after a
quick check in the documentation[1]. Thus, the lowest limit that I could
configure is 1/m, but that's something totally different from what I/we
want. If you have a solution to configure arbitrary limits directly in
nginx I'd love to know about it.

[1]
http://nginx.org/en/docs/http/ngx_http_limit_req_module.html#limit_req_zone

Florian


signature.asc
Description: PGP signature


Re: [aur-general] AUR package metadata dump

2018-11-16 Thread Dmitry Marakasov
* brent s. (b...@square-r00t.net) wrote:

> (SNIP)
> >> While fetching data from API, Repology does a 1 second pause between
> >> requests to not create excess load on the server, but there are still
> >> frequent 429 errors. I've tried 2 second delays, but the 429s are still
> >> there, and fetch time increases dramatically as we have to do more than
> >> 500 requests. Probably API is loaded by other clients as well.
> > 
> > Our rate limit is 4000 per 24 hours. One-second pauses aren't taken into
> > account, and our initial motivation to add rate limiting was to ban
> > users who were using 5-second delays...
> > 
> (SNIP)
> 
> 
> don't forget about the URI max length, too. staggering into requests of
> 100 pkgs would work fine, but worth noting the max length is 4443 bytes
> 
> https://wiki.archlinux.org/index.php/Aurweb_RPC_interface#Limitations

Actually it does work fine with URL lengths up to 8k.

-- 
Dmitry Marakasov   .   55B5 0596 FF1E 8D84 5F56  9510 D35A 80DD F9D2 F77D
amd...@amdmi3.ru  ..:  https://github.com/AMDmi3


Re: [aur-general] AUR package metadata dump

2018-11-16 Thread Dmitry Marakasov
* Florian Pritz via aur-general (aur-general@archlinux.org) wrote:

> > The way Repology currently fetches AUR package data is as follows:
> > - fetch https://aur.archlinux.org/packages.gz
> > - split packages into 100 item packs
> > - fetch JSON data for packages in each pack from 
> > https://aur.archlinux.org/rpc/?v=5&type=info&arg[]=
> > 
> > While fetching data from API, Repology does a 1 second pause between
> > requests to not create excess load on the server, but there are still
> > frequent 429 errors. I've tried 2 second delays, but the 429s are still
> > there, and fetch time increases dramatically as we have to do more than
> > 500 requests. Probably API is loaded by other clients as well.
> 
> The rate limit allows 4000 API requests per source IP in a 24 hour
> window. It does not matter which type of request you send or how many
> packages you request information for. Spreading out requests is still
> appreciated, but it mostly won't influence your rate limit.
> 
> The packages.gz file currently contains around 53000 packages. If you
> split those into packs of 100 each and then perform a single API request
> for each pack to fetch all the details, you end up with roughly 530
> requests. Given you hit the limit, you probably check multiple times
> each day, correct? I'd suggest to spread the checks over a 6 hour period
> or longer. This should keep you well below the limit.

Thanks for clarification! Correct, I'm doing multiple updates a day. The
rate is varying but is about once each 2 hours. I guess I can stuff
more packages into a single request for now. Later proper update
scheduling will be implemented (which will allow to e.g. set aur to update
no faster than every 3 hours), but I hope to facilitate making a json
dump which would allow both faster and more frequent updates.

> > I suggest to implement a regularly updated JSON dump of information
> > on all packages and make it available for the site, like packages.gz is.
> > The content should be similar to what 
> > https://aur.archlinux.org/rpc/?v=5&type=info
> > would return for all packages at once.
> >
> > This will eliminate the need to access the API and generate load
> > on it, simplify and speed up fetching dramatically for both Repology
> > and possible other clients.
> 
> It may also generate much more network traffic since the problem that
> prompted the creation of the rate limit was that people ran update check
> scripts every 5 or 10 seconds via conky. Some of those resulted in up to
> 40 millions of requests on a single day due to inefficient clients and a
> huge number of checked packages. I'm somewhat worried that a central
> dump may just invite people to write clients that fetch it and then we
> start this whole thing again. Granted, it's only a single request per
> check, but the response is likely quite big. Maybe the best way to do
> this is to actually implement it as an API call and thus share the rate
> limit with the rest of the API to prevent abuse.

The same way as I've replied to Eli, suggesting to implement an API
call is a strange thing to suggest as it'll make it much easier to
generate more load on the server and more trafic.

The benefits of the dump as I see it are:

- Much less load on the server.

  I've looked through API code and it does an extra SQL query per a
  package to get extended data such as dependencies and licenses, which
  consists of multiple unions and joins involving 10 tables. That looks
  extremely heavy, and getting a dump through API is equivalent to
  issuing this heavy query 53k times (e.g. for each package).

  Dump OTOH may be done hourly, and it will eliminate the need for
  client to reside to these heavy quries.

- Less traffic usage, as the static dump can be

  - Compressed
  - Cached
  - Not transfered at all if it hasn't changed since the previous
requests, e.g. based on If-Modified-Since or related header

I don't think that the existence of the dump will encourage clients
which don't need ALL the data as it's still heavier to download and
decompress - doing that every X seconds will create noticeable load
on the clients asking to redo it in a proper way.

It can also still be rate limited (separately from API, and probably with
much lower rate, e.g. 4 RPH looks reasonable) - I see aur.archlinux.org
uses nginx, and it supports such rate limiting pretty well.

> Apart from all that, I'd suggest that you propose the idea (or a patch)
> on the aur-dev mailing list, assuming that there isn't a huge discussion
> about it here first.

-- 
Dmitry Marakasov   .   55B5 0596 FF1E 8D84 5F56  9510 D35A 80DD F9D2 F77D
amd...@amdmi3.ru  ..:  https://github.com/AMDmi3


Re: [aur-general] AUR package metadata dump

2018-11-16 Thread Dmitry Marakasov
* Eli Schwartz via aur-general (aur-general@archlinux.org) wrote:

> > I'm maintainer of Repology.org, a service which monitors, aggregates
> > and compares package vesion accross 200+ package repositories with
> > a purpose of simplifying package maintainers work by discovering
> > new versions faster, improving collaboration between maintainers
> > and giving software authhors a complete overview of how well their
> > projects are packaged.
> > 
> > Repology does obviously support AUR, however there were some problems
> > with retrieving information on AUR packages and I think this could
> > be improved.
> > 
> > The way Repology currently fetches AUR package data is as follows:
> > - fetch https://aur.archlinux.org/packages.gz
> > - split packages into 100 item packs
> > - fetch JSON data for packages in each pack from 
> > https://aur.archlinux.org/rpc/?v=5&type=info&arg[]=
> > 
> > While fetching data from API, Repology does a 1 second pause between
> > requests to not create excess load on the server, but there are still
> > frequent 429 errors. I've tried 2 second delays, but the 429s are still
> > there, and fetch time increases dramatically as we have to do more than
> > 500 requests. Probably API is loaded by other clients as well.
> 
> Our rate limit is 4000 per 24 hours. One-second pauses aren't taken into
> account, and our initial motivation to add rate limiting was to ban
> users who were using 5-second delays...
> 
> Please read our documentation on the limits here:
> https://wiki.archlinux.org/index.php/Aurweb_RPC_interface#Limitations

Got it, thanks for clarification.

> A single request should be able to return as many packages as needed as
> long as it conforms to the limitations imposed by the URI length.

There's also a 5000 max_rpc_results limit.

But, requesting more packages in each request should fix my problem
for now. Later I'll implement finer update frequency control too, so
e.g. AUR could be updated no more frequent than 3 hours or so.

> > I suggest to implement a regularly updated JSON dump of information
> > on all packages and make it available for the site, like packages.gz is.
> > The content should be similar to what 
> > https://aur.archlinux.org/rpc/?v=5&type=info
> > would return for all packages at once.
> 
> If the RPC interface had a parameter to circumvent the
> arg[]=pkg1&arg[]=pkg2 search, and simply request all packages, that
> would already do what you want, I guess.

That's a strange thing to suggest.

Obviously there was a reason for API rate limiting, probably excess CPU
load or trafic usage. And allowing to fetch all packages from API will
make creating these kinds of load even easier, without hitting the rate
limit. It will also require more memory as it accumulates all the data
before sending it to the client.

> > This will eliminate the need to access the API and generate load
> > on it, simplify and speed up fetching dramatically for both Repology
> > and possible other clients.
> > 
> > Additionally, I'd like to suggest to add information on distfiles to the
> > dump (and probably an API as well for consistency). For instance,
> > Repology checks availability for all (homepage and download) links
> > it retreives from package repositories and reports broken ones so
> > the packages could be fixed.
> 
> The source code running the website is here:
> https://git.archlinux.org/aurweb.git/about/
> 
> We currently provide the url, but not the sources for download, since
> the use case for our community has not (yet?) proposed that the latter
> is something needed. I'm unsure who would use it other than repology.
> 
> If you would like to submit a patch to implement the API that would help
> you, feel free (I'm open to discussion on merging it). However, I don't
> know if any current aurweb contributors are interested in doing the
> work. I know I'm not.

How about this?

https://github.com/AMDmi3/aurweb/compare/expose-package-sources

Not tested though as I'd have to install an Arch VM for proper testing
and this can take time.

-- 
Dmitry Marakasov   .   55B5 0596 FF1E 8D84 5F56  9510 D35A 80DD F9D2 F77D
amd...@amdmi3.ru  ..:  https://github.com/AMDmi3


Re: [aur-general] AUR package metadata dump

2018-11-15 Thread Jiachen YANG via aur-general

On 2018/11/16 7:31, Uwe Koloska wrote:
> Hi Eli,
>
> Am 15.11.18 um 20:26 schrieb Eli Schwartz via aur-general:
>> The source code running the website is here:
>> https://git.archlinux.org/aurweb.git/about/
>>
>> We currently provide the url, but not the sources for download, since
>> the use case for our community has not (yet?) proposed that the latter
>> is something needed. I'm unsure who would use it other than repology.
> I don't understand what "url" and "sources" refer to. Obviously its not
> the Sourcecode of aurweb, because that's available in the linked git
> repo, isn't it?
>
> If both refer to something inside the quote, then the reference is very
> far from its destination ...
>
> Uwe


Hi Uwe,


First of all thank you for Repology, a interesting and useful project.

I think the "url" and "sources" are refering to the 2 variables in
PKGBUILD. "url" is the url to the upstream homepage, and "sources" are
urls to download the source code, or in the case of VCS packages, the
urls to fetch VCS repositories. I think "sources" is closest thing to
"distfiles" you asked in your first message. Please see the manpage of
PKGBUILD for details [1]. These are defined in PKGBUILD and generated in
.SRCINFO for the AUR packages. And currently we only have "URL" field
exposed in the aur rpc api.

[1]:
https://www.archlinux.org/pacman/PKGBUILD.5.html#_options_and_directives



farseerfc


signature.asc
Description: OpenPGP digital signature


Re: [aur-general] AUR package metadata dump

2018-11-15 Thread Uwe Koloska
Hi Eli,

Am 15.11.18 um 20:26 schrieb Eli Schwartz via aur-general:
> The source code running the website is here:
> https://git.archlinux.org/aurweb.git/about/
> 
> We currently provide the url, but not the sources for download, since
> the use case for our community has not (yet?) proposed that the latter
> is something needed. I'm unsure who would use it other than repology.

I don't understand what "url" and "sources" refer to. Obviously its not
the Sourcecode of aurweb, because that's available in the linked git
repo, isn't it?

If both refer to something inside the quote, then the reference is very
far from its destination ...

Uwe


Re: [aur-general] AUR package metadata dump

2018-11-15 Thread brent s.
On 11/15/18 2:58 PM, Eli Schwartz via aur-general wrote:
> 
> It's a pity that I forgot to reply with the exact same link and almost
> the exact same caveat in the very next paragraph, isn't it?
> 
> The paragraph which you quoted as "(SNIP)".
> 

it most likely would have been more noticeable if you trimmed the quoted
content down to the relevant bits instead of including it whole.


-- 
brent saner
https://square-r00t.net/
GPG info: https://square-r00t.net/gpg-info



signature.asc
Description: OpenPGP digital signature


Re: [aur-general] AUR package metadata dump

2018-11-15 Thread Eli Schwartz via aur-general
On 11/15/18 2:50 PM, brent s. wrote:
> On 11/15/18 14:26, Eli Schwartz via aur-general wrote:
>> On 11/15/18 1:25 PM, Dmitry Marakasov wrote:
>>> Hi!
>>>
> (SNIP)
>>> While fetching data from API, Repology does a 1 second pause between
>>> requests to not create excess load on the server, but there are still
>>> frequent 429 errors. I've tried 2 second delays, but the 429s are still
>>> there, and fetch time increases dramatically as we have to do more than
>>> 500 requests. Probably API is loaded by other clients as well.
>>
>> Our rate limit is 4000 per 24 hours. One-second pauses aren't taken into
>> account, and our initial motivation to add rate limiting was to ban
>> users who were using 5-second delays...
>>
> (SNIP)
> 
> 
> don't forget about the URI max length, too. staggering into requests of
> 100 pkgs would work fine, but worth noting the max length is 4443 bytes
> 
> https://wiki.archlinux.org/index.php/Aurweb_RPC_interface#Limitations

It's a pity that I forgot to reply with the exact same link and almost
the exact same caveat in the very next paragraph, isn't it?

The paragraph which you quoted as "(SNIP)".

-- 
Eli Schwartz
Bug Wrangler and Trusted User



signature.asc
Description: OpenPGP digital signature


Re: [aur-general] AUR package metadata dump

2018-11-15 Thread brent s.
On 11/15/18 14:26, Eli Schwartz via aur-general wrote:
> On 11/15/18 1:25 PM, Dmitry Marakasov wrote:
>> Hi!
>>
(SNIP)
>> While fetching data from API, Repology does a 1 second pause between
>> requests to not create excess load on the server, but there are still
>> frequent 429 errors. I've tried 2 second delays, but the 429s are still
>> there, and fetch time increases dramatically as we have to do more than
>> 500 requests. Probably API is loaded by other clients as well.
> 
> Our rate limit is 4000 per 24 hours. One-second pauses aren't taken into
> account, and our initial motivation to add rate limiting was to ban
> users who were using 5-second delays...
> 
(SNIP)


don't forget about the URI max length, too. staggering into requests of
100 pkgs would work fine, but worth noting the max length is 4443 bytes

https://wiki.archlinux.org/index.php/Aurweb_RPC_interface#Limitations

-- 
brent saner
https://square-r00t.net/
GPG info: https://square-r00t.net/gpg-info



signature.asc
Description: OpenPGP digital signature


Re: [aur-general] AUR package metadata dump

2018-11-15 Thread Eli Schwartz via aur-general
On 11/15/18 1:25 PM, Dmitry Marakasov wrote:
> Hi!
> 
> I'm maintainer of Repology.org, a service which monitors, aggregates
> and compares package vesion accross 200+ package repositories with
> a purpose of simplifying package maintainers work by discovering
> new versions faster, improving collaboration between maintainers
> and giving software authhors a complete overview of how well their
> projects are packaged.
> 
> Repology does obviously support AUR, however there were some problems
> with retrieving information on AUR packages and I think this could
> be improved.
> 
> The way Repology currently fetches AUR package data is as follows:
> - fetch https://aur.archlinux.org/packages.gz
> - split packages into 100 item packs
> - fetch JSON data for packages in each pack from 
> https://aur.archlinux.org/rpc/?v=5&type=info&arg[]=
> 
> While fetching data from API, Repology does a 1 second pause between
> requests to not create excess load on the server, but there are still
> frequent 429 errors. I've tried 2 second delays, but the 429s are still
> there, and fetch time increases dramatically as we have to do more than
> 500 requests. Probably API is loaded by other clients as well.

Our rate limit is 4000 per 24 hours. One-second pauses aren't taken into
account, and our initial motivation to add rate limiting was to ban
users who were using 5-second delays...

Please read our documentation on the limits here:
https://wiki.archlinux.org/index.php/Aurweb_RPC_interface#Limitations

A single request should be able to return as many packages as needed as
long as it conforms to the limitations imposed by the URI length.

> I suggest to implement a regularly updated JSON dump of information
> on all packages and make it available for the site, like packages.gz is.
> The content should be similar to what 
> https://aur.archlinux.org/rpc/?v=5&type=info
> would return for all packages at once.

If the RPC interface had a parameter to circumvent the
arg[]=pkg1&arg[]=pkg2 search, and simply request all packages, that
would already do what you want, I guess.

> This will eliminate the need to access the API and generate load
> on it, simplify and speed up fetching dramatically for both Repology
> and possible other clients.
> 
> Additionally, I'd like to suggest to add information on distfiles to the
> dump (and probably an API as well for consistency). For instance,
> Repology checks availability for all (homepage and download) links
> it retreives from package repositories and reports broken ones so
> the packages could be fixed.

The source code running the website is here:
https://git.archlinux.org/aurweb.git/about/

We currently provide the url, but not the sources for download, since
the use case for our community has not (yet?) proposed that the latter
is something needed. I'm unsure who would use it other than repology.

If you would like to submit a patch to implement the API that would help
you, feel free (I'm open to discussion on merging it). However, I don't
know if any current aurweb contributors are interested in doing the
work. I know I'm not.

-- 
Eli Schwartz
Bug Wrangler and Trusted User



signature.asc
Description: OpenPGP digital signature


Re: [aur-general] AUR package metadata dump

2018-11-15 Thread Florian Pritz via aur-general
On Thu, Nov 15, 2018 at 09:25:02PM +0300, Dmitry Marakasov  
wrote:
> The way Repology currently fetches AUR package data is as follows:
> - fetch https://aur.archlinux.org/packages.gz
> - split packages into 100 item packs
> - fetch JSON data for packages in each pack from 
> https://aur.archlinux.org/rpc/?v=5&type=info&arg[]=
> 
> While fetching data from API, Repology does a 1 second pause between
> requests to not create excess load on the server, but there are still
> frequent 429 errors. I've tried 2 second delays, but the 429s are still
> there, and fetch time increases dramatically as we have to do more than
> 500 requests. Probably API is loaded by other clients as well.

The rate limit allows 4000 API requests per source IP in a 24 hour
window. It does not matter which type of request you send or how many
packages you request information for. Spreading out requests is still
appreciated, but it mostly won't influence your rate limit.

The packages.gz file currently contains around 53000 packages. If you
split those into packs of 100 each and then perform a single API request
for each pack to fetch all the details, you end up with roughly 530
requests. Given you hit the limit, you probably check multiple times
each day, correct? I'd suggest to spread the checks over a 6 hour period
or longer. This should keep you well below the limit.

> I suggest to implement a regularly updated JSON dump of information
> on all packages and make it available for the site, like packages.gz is.
> The content should be similar to what 
> https://aur.archlinux.org/rpc/?v=5&type=info
> would return for all packages at once.
>
> This will eliminate the need to access the API and generate load
> on it, simplify and speed up fetching dramatically for both Repology
> and possible other clients.

It may also generate much more network traffic since the problem that
prompted the creation of the rate limit was that people ran update check
scripts every 5 or 10 seconds via conky. Some of those resulted in up to
40 millions of requests on a single day due to inefficient clients and a
huge number of checked packages. I'm somewhat worried that a central
dump may just invite people to write clients that fetch it and then we
start this whole thing again. Granted, it's only a single request per
check, but the response is likely quite big. Maybe the best way to do
this is to actually implement it as an API call and thus share the rate
limit with the rest of the API to prevent abuse.

Apart from all that, I'd suggest that you propose the idea (or a patch)
on the aur-dev mailing list, assuming that there isn't a huge discussion
about it here first.

Florian


signature.asc
Description: PGP signature


[aur-general] AUR package metadata dump

2018-11-15 Thread Dmitry Marakasov
Hi!

I'm maintainer of Repology.org, a service which monitors, aggregates
and compares package vesion accross 200+ package repositories with
a purpose of simplifying package maintainers work by discovering
new versions faster, improving collaboration between maintainers
and giving software authhors a complete overview of how well their
projects are packaged.

Repology does obviously support AUR, however there were some problems
with retrieving information on AUR packages and I think this could
be improved.

The way Repology currently fetches AUR package data is as follows:
- fetch https://aur.archlinux.org/packages.gz
- split packages into 100 item packs
- fetch JSON data for packages in each pack from 
https://aur.archlinux.org/rpc/?v=5&type=info&arg[]=

While fetching data from API, Repology does a 1 second pause between
requests to not create excess load on the server, but there are still
frequent 429 errors. I've tried 2 second delays, but the 429s are still
there, and fetch time increases dramatically as we have to do more than
500 requests. Probably API is loaded by other clients as well.

I suggest to implement a regularly updated JSON dump of information
on all packages and make it available for the site, like packages.gz is.
The content should be similar to what 
https://aur.archlinux.org/rpc/?v=5&type=info
would return for all packages at once.

This will eliminate the need to access the API and generate load
on it, simplify and speed up fetching dramatically for both Repology
and possible other clients.

Additionally, I'd like to suggest to add information on distfiles to the
dump (and probably an API as well for consistency). For instance,
Repology checks availability for all (homepage and download) links
it retreives from package repositories and reports broken ones so
the packages could be fixed.

-- 
Dmitry Marakasov   .   55B5 0596 FF1E 8D84 5F56  9510 D35A 80DD F9D2 F77D
amd...@amdmi3.ru  ..:  https://github.com/AMDmi3