On Mon, Apr 01, 2024 at 01:21:45PM +0000, Eric Wong wrote:
> Performance is still slow, and crawler traffic patterns tend to
> do bad things with caches at all levels, so I've regretfully had
> to experiment with robots.txt to mitigate performance problems.

This has been the source of grief for us, because aggressive bots don't appear
to be paying any attention to robots.txt, and they are fudging their
user-agent string to pretend to be a regular browser. I am dealing with one
that is hammering us from China Mobile IP ranges and is currently trying to
download every possible snapshot of torvalds/linux, while pretending to be
various versions of Chrome.

So, while I welcome having a robots.txt recommendation, it kinda assumes that
robots will actually play nice and won't try to suck down as much as possible
as quickly as possible for training some LLM-du-jour.

/end rant

-K

Reply via email to