All,
We are happy with Anubis. It's nothing short of a miracle. It seems to
be doing the trick. The downside is the cartoon character on the load
screen. It is cute, but on production it has a negative/confusing
component for staff and patrons. We've put together an Ansible playbook
that installs Anubis from it's source code. It makes a minor tweak to
the CSS that turns the page white. So the experience is an empty screen
for a moment while Anubis does it's thing. And then it's over. No one is
none the wiser.
In case anyone is interested:
https://github.com/mcoia/mobius_evergreen/tree/master/custom_anubis
PR's are welcome!
Keep in mind: this playbook does not* mess with NGINX's site config.
You'll need to customize that to proxy traffic to Anubis if you want to
enable it.
Also: it assumes a few things:
1. A user named "opensrf" exists
2. A folder /home/opensrf/repos exists.
3. Evergreen is installed with developer tools (make -f
Open-ILS/src/extras/Makefile.install <osname>-developer)
-Blake-
Conducting Magic
Will consume any data format
MOBIUS
On 7/11/2025 11:57 AM, Josh Stompro via Evergreen-dev wrote:
If you need an Aspen site to be a test site for Anubis you can put
LARL down on the list. I believe you can whitelist IPs with Anubis so
our branches and catalog stations can skip the checks.
Josh
On Thu, Jul 10, 2025 at 11:58 AM Jason Boyer via Evergreen-dev
<[email protected]> wrote:
You can probably give up trying to look for IPs that send large
numbers of requests, what I'm seeing more and more are requests
from these jerks or their peers: https://bright
data.com/ai/agent-browser <http://data.com/ai/agent-browser> who
have "residential proxies," i.e. the browser extensions mentioned
in the story Josh posted. They send literally a single http
request from an IP (usually on a US telecom provider's network so
you can't reasonably block it) and then the next request comes in
from a different IP.
The patch in the bug Mike posted helps significantly and unless
users trade a lot of direct links to search results they shouldn't
be able to even detect it.
I'm looking into Anubis because we can put it in front of things
more easily than baking countermeasures into everything we host.
Being completely self-contained (i.e. it doesn't contact a remote
server unless you want to use a geo ip / AS number blocking
service) I prefer it to cloudflare, especially since their "good"
bot blocking isn't affordable for libraries. (I think the free
level basically just doesn't allow things that use a "real" bot UA
to connect to your system, if you want to block anything like a
residential proxy you have to pay)
Some thoughts on UA blocking since it's come up a little: don't
forget you can do things like block things claiming to be Chrome <
100 on Windows or macOS, and limit Linux versions to a different
limit. Chrome will go so far on Windows and Macs as to tell you
"ok look, it's been too long, I'm restarting and then we'll go to
whatever page," so very old versions on those OSes are extremely
unlikely. Linux can be a concern though, in case you have
libraries that have very old OPACs or similar. Also be sure to
block things like Windows 95 / 98 (but again, maybe some libraries
have win 7 opacs. :( ), old versions of Firefox, and anything
claiming to be IE. Things actually that old likely can't even
complete an SSL handshake anymore after some of the root certs
have been rotated. A lot of proxies are using randomly-constructed
UAs to make it harder to bulk-block them.
Jason
--
Jason Boyer
Senior System Administrator
Equinox Open Library Initiative
[email protected]
+1 (877) Open-ILS (673-6457)
https://equinoxOLI.org/ <https://equinoxOLI.org/>
On Thu, Jul 10, 2025 at 12:08 PM Mike Rylander via Evergreen-dev
<[email protected]> wrote:
Some things to consider, inline below...
On Thu, Jul 10, 2025 at 11:25 AM John Merriam via Evergreen-dev
<[email protected]> wrote:
>
> Hello.
>
> This will block Chrome older than 110 (over 2 years old) in
Nginx:
>
> if ($http_user_agent ~*
"(Chrome/10[0-9]\.|Chrome/[0-9][0-9]\.|Chrome/[0-9]\.)") {
> return 403;
> }
>
> which put a stop to it for now for us.
>
Please be careful. In addition to patrons with old browsers
(there
are plenty out there, unfortunately) there are some black-box
kiosks
out in the wild that are used for selfcheck and in-building OPAC
machines which use an older Chrome (and are not free to upgrade).
> Changing user agents is trivial though so finding other
blockable patterns such as in URLs would be good. I didn't
find a good pattern to the URLs yet but I was only able to
look at that quickly. I plan on circling back around to that
at some point.
>
> I don't think blocking by IP will work against what seems to
be a distributed AI botnet. A few months ago we had our data
center partners block all non-US IPs. That worked for a few
months but even that doesn't work anymore. We see AI bot
traffic coming from US residential IP ranges. A gigantic
question I have is how are they appearing to come from
residential IPs and how could that be stopped?
>
> We plan to profile Evergreen looking for slow code that
could maybe be improved but that will be a big project.
>
I invite more eyes, of course, but "big project" is a bit of an
understatement. ;)
Please be careful when testing something that seems "slow" in
isolation -- making code X 10% faster will often make
seemingly-unrelated code Y 90% slower.
> We also plan to hook a WAF with machine learning into Nginx
and see what that can do. Another big project.
>
> We may also put captcha on more parts of the OPAC. We have
someone working on that.
>
Have you looked at
https://bugs.launchpad.net/evergreen/+bug/2113979?
With some refinement of the URL space where the not-a-bot
cookie is
required, this is shaping up to be a good first-order bot killer.
> I can allocate more resources to the OPAC but that seems
like letting them win and they will probably eventually
exhaust that as well.
>
> Anubis is a nuclear option I would like to avoid.
>
I'm curious why you see this as a nuclear option. Granted, most AI
scrapers right now (at least, AFAICT) seem to be essentially
stateless, so it may be overkill compared to the LP bug linked
above,
but it's fairly straight-forward to set up and maintain. The only
drawback right now is that you have to use just one instance,
which
could become a bottleneck in a very "wide" EG setup.
> Also don't want to turn to something like Cloudflare.
>
It's certainly not cost effective for the Library space...
> Please do share any findings and I will as well.
>
> Thanks
>
>
> On 7/10/2025 10:53 AM, Josh Stompro via Evergreen-dev wrote:
>
> One piece of this puzzle that I would like to understand
better is how the bad actors are targeting our sites with
thousands to hundreds of thousands of unique IP endpoints each
day. And I just saw this article come out about how 1 million
browsers have installed extensions that turn the users browser
into scrapers.
>
>
https://arstechnica.com/security/2025/07/browser-extensions-turn-nearly-1-million-browsers-into-website-scraping-bots/
>
> Josh
>
>
> On Thu, Feb 13, 2025 at 3:49 PM Shula Link via Evergreen-dev
<[email protected]> wrote:
>>
>> It's not just Evergreen sites. I had to block all traffic
from Hong Kong to our system website after we had a greater
than 10x increase in visitors overnight. I tried doing it by
IP, but they just changed, so it ended up just being easier to
just block everything.
>>
>> Shula Link (she/her)
>> Systems Services Librarian
>> Greater Clarks Hill Regional Library
>> [email protected] | [email protected]
>> 706-447-6702
>>
>>
>> On Thu, Feb 13, 2025 at 4:46 PM Blake Graham-Henderson via
Evergreen-dev <[email protected]> wrote:
>>>
>>> All,
>>>
>>> I almost replied with the arstechnica article that Josh
linked when the thread was started. But I decided not to put
it out there until I had setup a test system to see if I could
get that code working. A tarpit, I think, serves them right.
And, of course, the whole issue is destined to receive the
fate of spam and spam filters forever and ever.
>>>
>>> It was a serendipitous timed article. It's existence at
this moment in time signals to me that this isn't a "just us"
problem. It's the entire planet.
>>>
>>> -Blake-
>>> Conducting Magic
>>> Will consume any data format
>>> MOBIUS
>>>
>>> On 2/13/2025 3:10 PM, Josh Stompro via Evergreen-dev wrote:
>>>
>>> Jeff, thanks for bringing this up on the list.
>>>
>>> We are seeing a lot of requests like
>>> "GET
/eg/opac/mylist/delete?anchor=record_184821&record=184821"
from never seen before IPs, and they make 1-12 requests and
then stop.
>>>
>>> And they seem like they usually have a random out of date
chrome version in the user agent string.
>>> Chrome/88.0.4324.192
>>> Chrome/86.0.4240.75
>>>
>>> I've been trying to slow down the bots by collecting logs
and grabbing all the obvious patterns and blocking netblocks
for non US ranges. ipinfo.io <http://ipinfo.io> offers a free
country & ASN database download that I've been using to look
up the ranges and countries.
(https://ipinfo.io/products/free-ip-database) I would be happy
to share a link to our current blocklist that has 10K non US
ranges.
>>>
>>> I've also been reporting the non US bot activity to
https://www.abuseipdb.com/ just to bring some visibility to
these bad bots. I noticed initially that many of the IPs that
we were getting hit from didn't seem to be listed on any
blocklists already, so I figured some reporting might help.
I'm kind of curious if Evergreen sites are getting hit from
the same IPs, so an evergreen specific blocklist would be
useful. If you look up your bot IPs on abuseipdb.com
<http://abuseipdb.com> you can see if I've already reported
any of them.
>>>
>>> I've also been making use of block lists from
https://iplists.firehol.org/
>>> Such as
>>> https://iplists.firehol.org/files/cleantalk_30d.ipset
>>> https://iplists.firehol.org/files/botscout_7d.ipset
>>> https://iplists.firehol.org/files/firehol_abusers_1d.netset
>>>
>>> We are using HAProxy so I did some looking into the
CrowdSec HAProxy Bouncer
(https://docs.crowdsec.net/u/bouncers/haproxy/) but I'm not
sure that would help since these IPs don't seem to be on
blocklists. But I may just not quite understand how CrowdSec
is supposed to work.
>>>
>>> HAProxy Enterprise has a ReCaptcha module that I think
would allow us to feed any non-us connections that haven't
connected before through a recaptcha, but the price for
HAProxy Enterprise is out of our budget.
https://www.haproxy.com/blog/announcing-haproxy-enterprise-3-0#new-captcha-and-saml-modules
>>>
>>> There is also a fairly up to date project for adding
Captchas through haproxy at
>>> https://github.com/ndbiaw/haproxy-protection, This looks
promising as a transparent method, requires new connections to
perform a javascript proof of work calculation before allowing
access. Could be a good transparent way of handling it.
>>>
>>> We were taken out by ChatGTP bots back in December, which
were a bit easier to block the netblocks since they were not
as spread out. I recently saw this article about how some
people are fighting back against bots that ignore robots.txt,
https://arstechnica.com/tech-policy/2025/01/ai-haters-build-tarpits-to-trap-and-trick-ai-scrapers-that-ignore-robots-txt/
>>>
>>> Josh
>>>
>>> On Mon, Jan 27, 2025 at 6:33 PM Jeff Davis via
Evergreen-dev <[email protected]> wrote:
>>>>
>>>> Hi folks,
>>>>
>>>> Our Evergreen environment has been experiencing a
higher-than-usual volume of unwanted bot traffic in recent
months. Much of this traffic looks like webcrawlers hitting
Evergreen-specific URLs from an enormous number of different
IP addresses. Judging from discussion in IRC last week, it
sounds like other EG admins have been seeing the same thing.
Does anyone have any recommendations for managing this traffic
and mitigating its impact?
>>>>
>>>> Some solutions that have been suggested/implemented so far:
>>>> - Geoblocking entire countries.
>>>> - Using Cloudflare's proxy service. There's some
trickiness in getting this to work with Evergreen.
>>>> - Putting certain OPAC pages behind a captcha.
>>>> - Deploying publicly-available blocklists of "bad bot"
IPs/useragents/etc. (good but limited, and not EG-specific).
>>>> - Teaching EG to identify and deal with bot traffic
itself (but arguably this should happen before the traffic
hits Evergreen).
>>>>
>>>> My organization is currently evaluating CrowdSec as
another possible solution. Any opinions on any of these
approaches?
>>>> --
>>>> Jeff Davis
>>>> BC Libraries Cooperative
>>>> _______________________________________________
>>>> Evergreen-dev mailing list
>>>> [email protected]
>>>>
http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-dev
>>>
>>>
>>> _______________________________________________
>>> Evergreen-dev mailing list
>>> [email protected]
>>>
http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-dev
>>>
>>>
>>> _______________________________________________
>>> Evergreen-dev mailing list
>>> [email protected]
>>>
http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-dev
>>
>> _______________________________________________
>> Evergreen-dev mailing list
>> [email protected]
>>
http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-dev
>
>
> _______________________________________________
> Evergreen-dev mailing list --
[email protected]
> To unsubscribe send an email to
[email protected]
>
> --
> John Merriam
> Director of Information Technology
> Bibliomation, Inc.
> 24 Wooster Ave.
> Waterbury, CT 06708
> 203-577-4070
>
> _______________________________________________
> Evergreen-dev mailing list --
[email protected]
> To unsubscribe send an email to
[email protected]
_______________________________________________
Evergreen-dev mailing list -- [email protected]
To unsubscribe send an email to
[email protected]
_______________________________________________
Evergreen-dev mailing list -- [email protected]
To unsubscribe send an email to
[email protected]
_______________________________________________
Evergreen-dev mailing list [email protected]
To unsubscribe send an email [email protected]
_______________________________________________
Evergreen-dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]