[Evergreen-dev] Re: Problematic bot traffic

Blake Graham-Henderson via Evergreen-dev Fri, 25 Jul 2025 08:58:24 -0700

All,

We are happy with Anubis. It's nothing short of a miracle. It seems tobe doing the trick. The downside is the cartoon character on the loadscreen. It is cute, but on production it has a negative/confusingcomponent for staff and patrons. We've put together an Ansible playbookthat installs Anubis from it's source code. It makes a minor tweak tothe CSS that turns the page white. So the experience is an empty screenfor a moment while Anubis does it's thing. And then it's over. No one isnone the wiser.


In case anyone is interested:

https://github.com/mcoia/mobius_evergreen/tree/master/custom_anubis

PR's are welcome!

Keep in mind: this playbook does not* mess with NGINX's site config.You'll need to customize that to proxy traffic to Anubis if you want toenable it.


Also: it assumes a few things:

1. A user named "opensrf" exists
2. A folder /home/opensrf/repos exists.

3. Evergreen is installed with developer tools (make -fOpen-ILS/src/extras/Makefile.install <osname>-developer)


-Blake-
Conducting Magic
Will consume any data format
MOBIUS


On 7/11/2025 11:57 AM, Josh Stompro via Evergreen-dev wrote:

If you need an Aspen site to be a test site for Anubis you can putLARL down on the list. I believe you can whitelist IPs with Anubis soour branches and catalog stations can skip the checks.

Josh

On Thu, Jul 10, 2025 at 11:58 AM Jason Boyer via Evergreen-dev<[email protected]> wrote:


    You can probably give up trying to look for IPs that send large
    numbers of requests, what I'm seeing more and more are requests
    from these jerks or their peers: https://bright
    data.com/ai/agent-browser <http://data.com/ai/agent-browser> who
    have "residential proxies," i.e. the browser extensions mentioned
    in the story Josh posted. They send literally a single http
    request from an IP (usually on a US telecom provider's network so
    you can't reasonably block it) and then the next request comes in
    from a different IP.

    The patch in the bug Mike posted helps significantly and unless
    users trade a lot of direct links to search results they shouldn't
    be able to even detect it.

    I'm looking into Anubis because we can put it in front of things
    more easily than baking countermeasures into everything we host.
    Being completely self-contained (i.e. it doesn't contact a remote
    server unless you want to use a geo ip / AS number blocking
    service) I prefer it to cloudflare, especially since their "good"
    bot blocking isn't affordable for libraries. (I think the free
    level basically just doesn't allow things that use a "real" bot UA
    to connect to your system, if you want to block anything like a
    residential proxy you have to pay)

    Some thoughts on UA blocking since it's come up a little: don't
    forget you can do things like block things claiming to be Chrome <
    100 on Windows or macOS, and limit Linux versions to a different
    limit. Chrome will go so far on Windows and Macs as to tell you
    "ok look, it's been too long, I'm restarting and then we'll go to
    whatever page," so very old versions on those OSes are extremely
    unlikely. Linux can be a concern though, in case you have
    libraries that have very old OPACs or similar. Also be sure to
    block things like Windows 95 / 98 (but again, maybe some libraries
    have win 7 opacs. :( ), old versions of Firefox, and anything
    claiming to be IE. Things actually that old likely can't even
    complete an SSL handshake anymore after some of the root certs
    have been rotated. A lot of proxies are using randomly-constructed
    UAs to make it harder to bulk-block them.

    Jason

--Jason Boyer

    Senior System Administrator
    Equinox Open Library Initiative
    [email protected]
    +1 (877) Open-ILS (673-6457)
    https://equinoxOLI.org/ <https://equinoxOLI.org/>


    On Thu, Jul 10, 2025 at 12:08 PM Mike Rylander via Evergreen-dev
    <[email protected]> wrote:

        Some things to consider, inline below...

        On Thu, Jul 10, 2025 at 11:25 AM John Merriam via Evergreen-dev
        <[email protected]> wrote:
        >
        > Hello.
        >
        > This will block Chrome older than 110 (over 2 years old) in
        Nginx:
        >
        > if ($http_user_agent ~*
        "(Chrome/10[0-9]\.|Chrome/[0-9][0-9]\.|Chrome/[0-9]\.)") {
        >     return 403;
        > }
        >
        > which put a stop to it for now for us.
        >

        Please be careful.  In addition to patrons with old browsers
        (there
        are plenty out there, unfortunately) there are some black-box
        kiosks
        out in the wild that are used for selfcheck and in-building OPAC
        machines which use an older Chrome (and are not free to upgrade).

        > Changing user agents is trivial though so finding other
        blockable patterns such as in URLs would be good.  I didn't
        find a good pattern to the URLs yet but I was only able to
        look at that quickly.  I plan on circling back around to that
        at some point.
        >
        > I don't think blocking by IP will work against what seems to
        be a distributed AI botnet.  A few months ago we had our data
        center partners block all non-US IPs.  That worked for a few
        months but even that doesn't work anymore.  We see AI bot
        traffic coming from US residential IP ranges.  A gigantic
        question I have is how are they appearing to come from
        residential IPs and how could that be stopped?
        >
        > We plan to profile Evergreen looking for slow code that
        could maybe be improved but that will be a big project.
        >

        I invite more eyes, of course, but "big project" is a bit of an
        understatement. ;)

        Please be careful when testing something that seems "slow" in
        isolation -- making code X 10% faster will often make
        seemingly-unrelated code Y 90% slower.

        > We also plan to hook a WAF with machine learning into Nginx
        and see what that can do.  Another big project.
        >
        > We may also put captcha on more parts of the OPAC. We have
        someone working on that.
        >

        Have you looked at
        https://bugs.launchpad.net/evergreen/+bug/2113979?
        With some refinement of the URL space where the not-a-bot
        cookie is
        required, this is shaping up to be a good first-order bot killer.

        > I can allocate more resources to the OPAC but that seems
        like letting them win and they will probably eventually
        exhaust that as well.
        >
        > Anubis is a nuclear option I would like to avoid.
        >

        I'm curious why you see this as a nuclear option. Granted, most AI
        scrapers right now (at least, AFAICT) seem to be essentially
        stateless, so it may be overkill compared to the LP bug linked
        above,
        but it's fairly straight-forward to set up and maintain. The only
        drawback right now is that you have to use just one instance,
        which
        could become a bottleneck in a very "wide" EG setup.

        > Also don't want to turn to something like Cloudflare.
        >

        It's certainly not cost effective for the Library space...

        > Please do share any findings and I will as well.
        >
        > Thanks
        >
        >
        > On 7/10/2025 10:53 AM, Josh Stompro via Evergreen-dev wrote:
        >
        > One piece of this puzzle that I would like to understand
        better is how the bad actors are targeting our sites with
        thousands to hundreds of thousands of unique IP endpoints each
        day.  And I just saw this article come out about how 1 million
        browsers have installed extensions that turn the users browser
        into scrapers.
        >
        >
        
https://arstechnica.com/security/2025/07/browser-extensions-turn-nearly-1-million-browsers-into-website-scraping-bots/
        >
        > Josh
        >
        >
        > On Thu, Feb 13, 2025 at 3:49 PM Shula Link via Evergreen-dev
        <[email protected]> wrote:
        >>
        >> It's not just Evergreen sites. I had to block all traffic
        from Hong Kong to our system website after we had a greater
        than 10x increase in visitors overnight. I tried doing it by
        IP, but they just changed, so it ended up just being easier to
        just block everything.
        >>
        >> Shula Link (she/her)
        >> Systems Services Librarian
        >> Greater Clarks Hill Regional Library
        >> [email protected] | [email protected]
        >> 706-447-6702
        >>
        >>
        >> On Thu, Feb 13, 2025 at 4:46 PM Blake Graham-Henderson via
        Evergreen-dev <[email protected]> wrote:
        >>>
        >>> All,
        >>>
        >>> I almost replied with the arstechnica article that Josh
        linked when the thread was started. But I decided not to put
        it out there until I had setup a test system to see if I could
        get that code working. A tarpit, I think, serves them right.
        And, of course, the whole issue is destined to receive the
        fate of spam and spam filters forever and ever.
        >>>
        >>> It was a serendipitous timed article. It's existence at
        this moment in time signals to me that this isn't a "just us"
        problem. It's the entire planet.
        >>>
        >>> -Blake-
        >>> Conducting Magic
        >>> Will consume any data format
        >>> MOBIUS
        >>>
        >>> On 2/13/2025 3:10 PM, Josh Stompro via Evergreen-dev wrote:
        >>>
        >>> Jeff, thanks for bringing this up on the list.
        >>>
        >>> We are seeing a lot of requests like
        >>>  "GET
        /eg/opac/mylist/delete?anchor=record_184821&record=184821"
        from never seen before IPs, and they make 1-12 requests and
        then stop.
        >>>
        >>> And they seem like they usually have a random out of date
        chrome version in the user agent string.
        >>> Chrome/88.0.4324.192
        >>> Chrome/86.0.4240.75
        >>>
        >>> I've been trying to slow down the bots by collecting logs
        and grabbing all the obvious patterns and blocking netblocks
        for non US ranges. ipinfo.io <http://ipinfo.io> offers a free
        country & ASN database download that I've been using to look
        up the ranges and countries.
        (https://ipinfo.io/products/free-ip-database) I would be happy
        to share a link to our current blocklist that has 10K non US
        ranges.
        >>>
        >>> I've also been reporting the non US bot activity to
        https://www.abuseipdb.com/ just to bring some visibility to
        these bad bots.  I noticed initially that many of the IPs that
        we were getting hit from didn't seem to be listed on any
        blocklists already, so I figured some reporting might help. 
        I'm kind of curious if Evergreen sites are getting hit from
        the same IPs, so an evergreen specific blocklist would be
        useful.  If you look up your bot IPs on abuseipdb.com
        <http://abuseipdb.com> you can see if I've already reported
        any of them.
        >>>
        >>> I've also been making use of block lists from
        https://iplists.firehol.org/
        >>> Such as
        >>> https://iplists.firehol.org/files/cleantalk_30d.ipset
        >>> https://iplists.firehol.org/files/botscout_7d.ipset
        >>> https://iplists.firehol.org/files/firehol_abusers_1d.netset
        >>>
        >>> We are using HAProxy so I did some looking into the
        CrowdSec HAProxy Bouncer
        (https://docs.crowdsec.net/u/bouncers/haproxy/) but I'm not
        sure that would help since these IPs don't seem to be on
        blocklists.  But I may just not quite understand how CrowdSec
        is supposed to work.
        >>>
        >>> HAProxy Enterprise has a ReCaptcha module that I think
        would allow us to feed any non-us connections that haven't
        connected before through a recaptcha, but the price for
        HAProxy Enterprise is out of our budget.
        
https://www.haproxy.com/blog/announcing-haproxy-enterprise-3-0#new-captcha-and-saml-modules
        >>>
        >>> There is also a fairly up to date project for adding
        Captchas through haproxy at
        >>> https://github.com/ndbiaw/haproxy-protection, This looks
        promising as a transparent method, requires new connections to
        perform a javascript proof of work calculation before allowing
        access.  Could be a good transparent way of handling it.
        >>>
        >>> We were taken out by ChatGTP bots back in December, which
        were a bit easier to block the netblocks since they were not
        as spread out.  I recently saw this article about how some
        people are fighting back against bots that ignore robots.txt,
        
https://arstechnica.com/tech-policy/2025/01/ai-haters-build-tarpits-to-trap-and-trick-ai-scrapers-that-ignore-robots-txt/
        >>>
        >>> Josh
        >>>
        >>> On Mon, Jan 27, 2025 at 6:33 PM Jeff Davis via
        Evergreen-dev <[email protected]> wrote:
        >>>>
        >>>> Hi folks,
        >>>>
        >>>> Our Evergreen environment has been experiencing a
        higher-than-usual volume of unwanted bot traffic in recent
        months. Much of this traffic looks like webcrawlers hitting
        Evergreen-specific URLs from an enormous number of different
        IP addresses. Judging from discussion in IRC last week, it
        sounds like other EG admins have been seeing the same thing.
        Does anyone have any recommendations for managing this traffic
        and mitigating its impact?
        >>>>
        >>>> Some solutions that have been suggested/implemented so far:
        >>>> - Geoblocking entire countries.
        >>>> - Using Cloudflare's proxy service. There's some
        trickiness in getting this to work with Evergreen.
        >>>> - Putting certain OPAC pages behind a captcha.
        >>>> - Deploying publicly-available blocklists of "bad bot"
        IPs/useragents/etc. (good but limited, and not EG-specific).
        >>>> - Teaching EG to identify and deal with bot traffic
        itself (but arguably this should happen before the traffic
        hits Evergreen).
        >>>>
        >>>> My organization is currently evaluating CrowdSec as
        another possible solution. Any opinions on any of these
        approaches?
        >>>> --
        >>>> Jeff Davis
        >>>> BC Libraries Cooperative
        >>>> _______________________________________________
        >>>> Evergreen-dev mailing list
        >>>> [email protected]
        >>>>
        http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-dev
        >>>
        >>>
        >>> _______________________________________________
        >>> Evergreen-dev mailing list
        >>> [email protected]
        >>>
        http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-dev
        >>>
        >>>
        >>> _______________________________________________
        >>> Evergreen-dev mailing list
        >>> [email protected]
        >>>
        http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-dev
        >>
        >> _______________________________________________
        >> Evergreen-dev mailing list
        >> [email protected]
        >>
        http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-dev
        >
        >
        > _______________________________________________
        > Evergreen-dev mailing list --
        [email protected]
        > To unsubscribe send an email to
        [email protected]
        >
        > --
        > John Merriam
        > Director of Information Technology
        > Bibliomation, Inc.
        > 24 Wooster Ave.
        > Waterbury, CT 06708
        > 203-577-4070
        >
        > _______________________________________________
        > Evergreen-dev mailing list --
        [email protected]
        > To unsubscribe send an email to
        [email protected]
        _______________________________________________
        Evergreen-dev mailing list -- [email protected]
        To unsubscribe send an email to
        [email protected]

    _______________________________________________
    Evergreen-dev mailing list -- [email protected]
    To unsubscribe send an email to
    [email protected]


_______________________________________________
Evergreen-dev mailing list [email protected]
To unsubscribe send an email [email protected]

_______________________________________________
Evergreen-dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[Evergreen-dev] Re: Problematic bot traffic

Reply via email to