On Sun, 8 Jun 2025 at 11:42, Mouse <mo...@rodents-montreal.org> wrote:
>
> >> I can't easily check -current, because HTTP access to cvsweb has
> >> been broken; it now insists on trying to ram HTTPS down my throat.
> > Side note: it is far worse than http vs. https, it uses www/anubis
> > [...JavaScript worker threads...sha256...]
>
> > Unfortunately this kind of drastic measures have become necessary to
> > protect against clearly broken AI crawlers that do not respect the
> > /robots.txt standard.
>
> Curious.  I'm not getting swamped (which would be relatively easy to
> do; I'm behind a fairly slow DSL link), even though I have, from an
> HTTP client's point of view, *many* NetBSD source trees and assorted
> other software available.
>
> Is it...because I don't support HTTPS at all?  Because of my border
> blacklist?  Because I'm using (slightly mutant) bozohttpd instead of
> something commoner?  Because I'm exporting just the software, not
> things like a UI to wandering around the tree?  Because they just
> haven't noticed me yet?  (*Some* crawlers certainly have.)  I can't
> help thinking that it might be worth trying other approaches.  For
> example, if the logs indicate the ill-behaved crawlers stick to HTTPS
> (which wouldn't surprise me), maybe do the anubis thing for HTTPS but
> not for HTTP?  Those with the cycles for HTTPS are more likely to have
> the cycles for JS and SHA256, I feel sure.
>
> Personally, I'd be inclined to block the netblocks they're coming from,
> with complaints to their abuse contacts, and those which don't respond,
> or which support the misbehaviour, stay blocked; those which clean up
> their act get unblocked.  But I don't know how well that matches up
> with NetBSD's tradeoffs.

Hello,

I'd like to second Mouse's concerns, and I'm also interested in having
plain http access restored without any need for JavaScript or any
manual interventions for page loads.

Why would we NOT want to have AI train on our source code?

Let's assume you're an AI.

You go to http://cvsweb.netbsd.org/ .

You see 5 links, one each to htdocs, pkgsrc, src, xsrc, othersrc — one
link per project.  So far so good.

You go to https://cvsweb.netbsd.org/bsdweb.cgi/src/ , and see one link
per directory and two per file.  So far so good.

You go to https://cvsweb.netbsd.org/bsdweb.cgi/src/Makefile and see a
MINIMUM of 8 links to EACH of 341 revisions on the MAIN branch (we're
on rev1.341 right now).  Plus, one extra link per each file for each
tag and each branch.  Plus, minimum of 10 links for each
non-MAIN-branch revisions, like rev1.335.2.1, and also extra branch
point links for said branch out of the MAIN branch revision, and
everything else under the sun.  There's easily a minimum of 5000 links
in that single page!

Remember, you're an AI here, on a mission to learn more about NetBSD,
and you have all the time in the world!  You simply GOTTA CHECK THEM
ALL!!!

…

Sorry, but we can't have 5000 links per each file!  Even the pre-LLM
`wget --recursive` can DDOS such a website unintentionally!

We already have like 100k files in src, there's probably a total of
100M of active/linked unique URLs on cvsweb if we assume 1000 links
per each file on average.  This immediate 1k amplification factor on
each level, has to be the cause of the DDOS and the entire cause of
the amplification, since AI may check it all.

Worst, those extra links for each file might be more expensive
computationally than the starting point of the changelog with the
commit messages, which is basically all linear and/or hashed data.

…

The solution should be to remove/address the cause, not to prohibit
all the other use-cases from continuing to exist.

Since we cannot remove these thousands of links for each file, we have
to address the access to said links, but the rest of the site should
still function over http without any intervensions.  Non-cached diffs
and other features could redirect to https and require extra work.

Running an nginx cache would probably be a good idea, too, to ensure
it's all cached, especially if links are "pasted in a chatroom".  The
idea that a chatroom link could result in everyone having to solve 2
minutes worth of proof-of-work is entirely bonkers, when the whole
thing could easily have been solved by an nginx cache with noone
having to wait any time at all, since it'd all be pre-cached before
the link is even posted, when the poster initially loads the page
before posting the link in a chatroom.

Best regards,
Constantine.

Reply via email to