Thank you for sharing the link to the Dark Visitors website - it looks
very useful, indeed!
Linda
On 4/19/24 20:21, Lolis, John via Evergreen-general wrote:
There's been quite a conversation on the CODE4LIB listserv about this
lately...
Scott Prater <0000007dd2c67ad2-dmarc-requ...@lists.clir.org>
Thu, 11 Apr, 10:43 (8 days ago)
to CODE4LIB
We've also been seeing some traffic from inconsiderate AI bots.
One of my colleagues came across this site, which tracks and documents
AI bots:
https://darkvisitors.com/
-- Scott
--
Scott Prater
Digital Library Architect
UW Digital Collections Center
University of Wisconsin - Madison
________________________________________
From: Code for Libraries <code4...@lists.clir.org> on behalf of Lolis,
John <jlo...@whiteplainsny.gov>
Sent: Wednesday, April 10, 2024 12:15 PM
To: code4...@lists.clir.org
Subject: Re: [CODE4LIB] blocking GPTBot?
This *sounds* as if it should help:
https://urldefense.com/v3/__https://searchengineland.com/google-extended-crawler-432636__;!!Mak6IKo!Pm6vbeyDLkzwxaEhcIBmaI0pK1d7U0GtguiIAgWmNzfNyOyR1m3n9iypyhqwZH3QxxIfNMIETf94S2_ioTtPPtfncyM$
John Lolis
Coordinator of Computer Systems
100 Martine Avenue
White Plains, NY 10601
tel: 1.914.422.1497
fax: 1.914.422.1452
https://urldefense.com/v3/__https://whiteplainslibrary.org/__;!!Mak6IKo!Pm6vbeyDLkzwxaEhcIBmaI0pK1d7U0GtguiIAgWmNzfNyOyR1m3n9iypyhqwZH3QxxIfNMIETf94S2_ioTtPwb7-RSk$
*“I would rather have questions that can’t be answered than answers that
can’t be questioned.”*
— Richard Feynman
<https://urldefense.com/v3/__https://click.fourhourmail.com/5qure95xkf7hvvo93wh2/7qh7h8h05vr4zrtz/aHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvUmljaGFyZF9GZXlubWFu__;!!Mak6IKo!Pm6vbeyDLkzwxaEhcIBmaI0pK1d7U0GtguiIAgWmNzfNyOyR1m3n9iypyhqwZH3QxxIfNMIETf94S2_ioTtP3X91XJ0$
>,
theoretical physicist and recipient of the Nobel Prize in Physics in 1965
On Mon, 8 Apr 2024 at 16:31, Jason Casden <cas...@gmail.com> wrote:
> Thanks for bringing this up, Eben. We've been having a horrible time
with
> these bots, including those from previously fairly well-behaved sources
> like Google. They've caused issues ranging from slow response times and
> high system load all the way up to outages for some older systems.
So far,
> our systems folks have been playing whack-a-mole with a combination
of IP
> range blocks and increasingly detailed robots.txt statements. A group is
> being convened to investigate more comprehensive options so I will be
> watching this thread closely.
>
> Jason
>
> On Mon, Apr 8, 2024 at 4:18 PM Eben English <eben.engl...@gmail.com>
> wrote:
>
> > Hi all,
> >
> > I'm wondering if other folks are seeing AI and/or ML-related crawlers
> like
> > GPTBot accessing your library's website, catalog, digital
collections, or
> > other sites.
> >
> > If so, are you blocking or disallowing these crawlers? Has anyone
come up
> > with any policies around this?
> >
> > We're debating whether to allow these types of bots to crawl our
digital
> > collections, many of which contain large amounts of copyrighted or "no
> > derivatives"-licensed materials. On one hand, these materials are
> available
> > for public view, but on the other hand the type of use that GPTBot and
> the
> > like are after (integrating the content into their models) could be
> > characterized as creating a derivative work, which is expressly
> > discouraged.
> >
> > Thanks,
> >
> > Eben English (he/him/his)
> > Digital Repository Services Manager
> > Boston Public Library
> >
>
John Lolis
Coordinator of Computer Systems
100 Martine Avenue
White Plains, NY 10601
tel: 1.914.422.1497
fax: 1.914.422.1452
https://whiteplainslibrary.org/
/“I would rather have questions that can’t be answered than answers
that can’t be questioned.”/
— Richard Feynman
<https://click.fourhourmail.com/5qure95xkf7hvvo93wh2/7qh7h8h05vr4zrtz/aHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvUmljaGFyZF9GZXlubWFu>,
theoretical physicist and recipient of the Nobel Prize in Physics in 1965
On Fri, 19 Apr 2024 at 07:05, Jane Sandberg via Evergreen-general
<evergreen-general@list.evergreen-ils.org> wrote:
Hi Linda,
It's not for Evergreen, but my colleague recently blocked
claudebot using fail2ban on our load balancer
<https://github.com/pulibrary/princeton_ansible/commit/6f9009249a168442391d90e2b75028d40a8a9e91>.
Essentially, fail2ban is configured to watch Nginx's access log,
and if more than 10 claudebot requests appear within the past
minute from a particular IP, it automatically blocks all requests
from that IP for the next 24 hours. I would think that something
similar could work for Apache's access log.
Good luck with the bots!
-Jane
El vie, 19 abr 2024 a la(s) 3:42 a.m., Linda Jansová via
Evergreen-general (evergreen-general@list.evergreen-ils.org) escribió:
Dear all,
Have any of you encountered an extensive crawling by
Bytespider and
Bytedance (see e.g.,
https://wordpress.org/support/topic/psa-bytedance-and-bytespider-bots-recommend-blocking/),
Claudebot or other AI bots?
If so, do you have any secret recipe how to disable the
crawler from
accessing the site?
Thank you very much for sharing your experience!
Linda
_______________________________________________
Evergreen-general mailing list
Evergreen-general@list.evergreen-ils.org
http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general
_______________________________________________
Evergreen-general mailing list
Evergreen-general@list.evergreen-ils.org
http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general
_______________________________________________
Evergreen-general mailing list
Evergreen-general@list.evergreen-ils.org
http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general
_______________________________________________
Evergreen-general mailing list
Evergreen-general@list.evergreen-ils.org
http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general