A robots.txt file can help with many spiders, along with a link to the 
dspace sitemap.

Sitemap: /jspui/sitemap 

The robots.txt file can include
Crawl-delay: 10 

and it is useful to disallow the search and browse links - eg

Disallow: /jspui/simple-search

May robots get lost in circling around the dspace search results

We use fail2ban to detect malicious activity mainly like high-rate hits on 
login endpoints.  However it can also be used to detect inordinate 
activity.   The Crawl-delay is honoured by many of the robots.
We tend to be aggressive in using fail2ban to block access to invalid and 
maliciously crafted urls.      



Edmund Balnaves
Prosentient Systems
https://www.prosentient.com.au
On Friday, January 20, 2023 at 12:27:24 AM UTC+11 Mark H. Wood wrote:

> On Thu, Jan 19, 2023 at 11:50:03AM +0100, Florian Wille wrote:
> > my DSpace (6.3) Site usually gets around 10k/h requests. This is handled 
> > quite well. But sometimes there are multiple 
> > bots/crawlers/spiders/indexers/harvester/whatevers throwing each up to 
> > 15k/h request at me at the same time and that on top of my 10k/h 
> > standart traffic. This my DSpace cannot handle and it becomes 
> > unresponsive, making the site seem offline to my users.
> > I performance tuned my Apache and Postgres to handle more 
> > request/connections and gave the system plenty ram/cpu but DSpace gives 
> > up, I think, it's the hibernate layer breaking down.
> > 
> > I was thinking of using fail2ban to get a lid on exessive requesting. 
> > Anyone experience with that, or are there some best practice guides for 
> > fail2ban with DSpace? I don't wanna block/drop legit 
> harvesters/indexers...
> > 
> > Also I came across mod_apache_rate_limit. Would that do any good for my 
> > case?
>
> Well, do you want to ban the spiders, or just slow them to a
> reasonable rate? If it were my site, unless I could identify some
> genuinely abusive clients, I'd go with rate limiting. There might be
> a case for banning some clients and slowing others.
>
> I'd probably choose something made for rate limiting, if I went that
> route, rather than pressing fail2ban into this sort of service. I do
> see that a number of others have used fail2ban in this way.
>
> But I haven't yet made the time to explore these options in depth.
> What we do here is to keep an eye on response time with 'monit'. If
> monit thinks DSpace is sick or has died, it kills and restarts
> Tomcat. That is kind of drastic but it does shed an excessive load.
>
> -- 
> Mark H. Wood
> Lead Technology Analyst
>
> University Library
> Indiana University - Purdue University Indianapolis
> 755 W. Michigan Street
> Indianapolis, IN 46202
> 317-274-0749 <(317)%20274-0749>
> www.ulib.iupui.edu
>

-- 
All messages to this mailing list should adhere to the Code of Conduct: 
https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx
--- 
You received this message because you are subscribed to the Google Groups 
"DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to dspace-tech+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/dspace-tech/5180dcac-2a4b-4727-87a6-5227a93c9fe4n%40googlegroups.com.

Reply via email to