There's been quite a conversation on the CODE4LIB listserv about this lately...
Scott Prater <0000007dd2c67ad2-dmarc-requ...@lists.clir.org> Thu, 11 Apr, 10:43 (8 days ago) to CODE4LIB We've also been seeing some traffic from inconsiderate AI bots. One of my colleagues came across this site, which tracks and documents AI bots: https://darkvisitors.com/ -- Scott -- Scott Prater Digital Library Architect UW Digital Collections Center University of Wisconsin - Madison ________________________________________ From: Code for Libraries <code4...@lists.clir.org> on behalf of Lolis, John <jlo...@whiteplainsny.gov> Sent: Wednesday, April 10, 2024 12:15 PM To: code4...@lists.clir.org Subject: Re: [CODE4LIB] blocking GPTBot? This *sounds* as if it should help: https://urldefense.com/v3/__https://searchengineland.com/google-extended-crawler-432636__;!!Mak6IKo!Pm6vbeyDLkzwxaEhcIBmaI0pK1d7U0GtguiIAgWmNzfNyOyR1m3n9iypyhqwZH3QxxIfNMIETf94S2_ioTtPPtfncyM$ John Lolis Coordinator of Computer Systems 100 Martine Avenue White Plains, NY 10601 tel: 1.914.422.1497 fax: 1.914.422.1452 https://urldefense.com/v3/__https://whiteplainslibrary.org/__;!!Mak6IKo!Pm6vbeyDLkzwxaEhcIBmaI0pK1d7U0GtguiIAgWmNzfNyOyR1m3n9iypyhqwZH3QxxIfNMIETf94S2_ioTtPwb7-RSk$ *“I would rather have questions that can’t be answered than answers that can’t be questioned.”* — Richard Feynman < https://urldefense.com/v3/__https://click.fourhourmail.com/5qure95xkf7hvvo93wh2/7qh7h8h05vr4zrtz/aHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvUmljaGFyZF9GZXlubWFu__;!!Mak6IKo!Pm6vbeyDLkzwxaEhcIBmaI0pK1d7U0GtguiIAgWmNzfNyOyR1m3n9iypyhqwZH3QxxIfNMIETf94S2_ioTtP3X91XJ0$ >, theoretical physicist and recipient of the Nobel Prize in Physics in 1965 On Mon, 8 Apr 2024 at 16:31, Jason Casden <cas...@gmail.com> wrote: > Thanks for bringing this up, Eben. We've been having a horrible time with > these bots, including those from previously fairly well-behaved sources > like Google. They've caused issues ranging from slow response times and > high system load all the way up to outages for some older systems. So far, > our systems folks have been playing whack-a-mole with a combination of IP > range blocks and increasingly detailed robots.txt statements. A group is > being convened to investigate more comprehensive options so I will be > watching this thread closely. > > Jason > > On Mon, Apr 8, 2024 at 4:18 PM Eben English <eben.engl...@gmail.com> > wrote: > > > Hi all, > > > > I'm wondering if other folks are seeing AI and/or ML-related crawlers > like > > GPTBot accessing your library's website, catalog, digital collections, or > > other sites. > > > > If so, are you blocking or disallowing these crawlers? Has anyone come up > > with any policies around this? > > > > We're debating whether to allow these types of bots to crawl our digital > > collections, many of which contain large amounts of copyrighted or "no > > derivatives"-licensed materials. On one hand, these materials are > available > > for public view, but on the other hand the type of use that GPTBot and > the > > like are after (integrating the content into their models) could be > > characterized as creating a derivative work, which is expressly > > discouraged. > > > > Thanks, > > > > Eben English (he/him/his) > > Digital Repository Services Manager > > Boston Public Library > > > John Lolis Coordinator of Computer Systems 100 Martine Avenue White Plains, NY 10601 tel: 1.914.422.1497 fax: 1.914.422.1452 https://whiteplainslibrary.org/ *“I would rather have questions that can’t be answered than answers that can’t be questioned.”* — Richard Feynman <https://click.fourhourmail.com/5qure95xkf7hvvo93wh2/7qh7h8h05vr4zrtz/aHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvUmljaGFyZF9GZXlubWFu>, theoretical physicist and recipient of the Nobel Prize in Physics in 1965 On Fri, 19 Apr 2024 at 07:05, Jane Sandberg via Evergreen-general < evergreen-general@list.evergreen-ils.org> wrote: > Hi Linda, > > It's not for Evergreen, but my colleague recently blocked claudebot using > fail2ban on our load balancer > <https://github.com/pulibrary/princeton_ansible/commit/6f9009249a168442391d90e2b75028d40a8a9e91>. > Essentially, fail2ban is configured to watch Nginx's access log, and if > more than 10 claudebot requests appear within the past minute from a > particular IP, it automatically blocks all requests from that IP for the > next 24 hours. I would think that something similar could work for > Apache's access log. > > Good luck with the bots! > > -Jane > > El vie, 19 abr 2024 a la(s) 3:42 a.m., Linda Jansová via Evergreen-general > (evergreen-general@list.evergreen-ils.org) escribió: > >> Dear all, >> >> Have any of you encountered an extensive crawling by Bytespider and >> Bytedance (see e.g., >> >> https://wordpress.org/support/topic/psa-bytedance-and-bytespider-bots-recommend-blocking/), >> >> Claudebot or other AI bots? >> >> If so, do you have any secret recipe how to disable the crawler from >> accessing the site? >> >> Thank you very much for sharing your experience! >> >> Linda >> >> _______________________________________________ >> Evergreen-general mailing list >> Evergreen-general@list.evergreen-ils.org >> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general >> > _______________________________________________ > Evergreen-general mailing list > Evergreen-general@list.evergreen-ils.org > http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general >
_______________________________________________ Evergreen-general mailing list Evergreen-general@list.evergreen-ils.org http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general