On 7/3/25 6:23 AM, Constantine A. Murenin wrote:
Can you really blame kids for looking at all 5000 links from a single
file, when you give them 5000 links to start with? Maybe start by not
giving the 5000 unique links from a single file, and implement caching
/ throttling? How could you know there's nothing interesting in there
if you don't visit it all for a few files first?
Are you intentionally misrepresenting the problem?
These AIs literally behave the exact same way as humans; they're
simply dumber and more persistent. The way CVSweb is designed, it's
easily DoS'able with the default `wget -r` and `wget --recursive` from
probably like 20 years ago?
This is complete BS. "wget -r" uses a single connection (at any point in
time). It uses a consistent source address. It actually honors
robots.txt by default. None of that applies to the current generation of
AI scrapers:
(1) They have no effective rate limiting mechanism on the origin side.
(2) They are intentionally distributing requests to avoid server side
rate limits.
(3) The combination of the two makes most caching useless.
(3) They (intentionally or maliciously) do not honor robots.txt.
(4) They are intentionally faking the user agent.
Comparing AI scrapers to regular non-criminal human behavior is ignorant
at best, but otherwise intellectually dishonest.
Joerg