I Work in a library so yes we have a similar Problem our solr ist used inderect 
by a Webapplikationen running in another Server

WE use https://wiki.archlinux.org/title/fail2ban to Block IPs which exceed a 
given number of requests per Minute
________________________________
Von: Dmitri Maziuk <dmitri.maz...@gmail.com>
Gesendet: Donnerstag, 20. Juni 2024 17:38:27
An: users@solr.apache.org
Betreff: are bots DoS'ing anyone else's Solr?

Hi all,

the latest mole in the eternal whack-a-mole game with web crawlers
(GPTBot) DoS'ed our Solr again & I took a closer look at the logs.
Here's what it looks like is happening:

- the bot is hitting a URL backed by Solr search and starts following
all permutations of facets and "next page"s at a rate of 60+ hits/second.
- Solr is not returning the results fast enough and the bot is dropping
connections.
- An INFO message is logged: jetty is "unable to write response, client
closed connection or we are shutting down" -- IOException on the
OutputStream: Closed.

These go on for a while until:

java.nio.file.FileSystemException:
$PATH_TO\server\solr\preview_shard1_replica_n2\data\tlog\buffer.tlog.0000800034318988100:
The process cannot access the file because it is being used by another
process.
  -- Different file suffix # on every one of those

And eventually an update comes in and fails with

ERROR (qtp173791568-23140) [c:preview s:shard1 r:core_node4
x:preview_shard1_replica_n2] o.a.s.h.RequestHandlerBase
org.apache.solr.common.SolrException: Error logging add =>
org.apache.solr.common.SolrException: Error logging add
      at
org.apache.solr.update.TransactionLog.write(TransactionLog.java:420)
org.apache.solr.common.SolrException: Error logging add

Caused by: java.io.IOException: There is not enough space on the disk
...

At this point Solr is hosed. Admin page shows "no collections available"
but does respond to queries; all queries from the website client (.NET)
are failing.

This is Solr 8-11.2 on winders server 2022/correto JVM 11.

So, questions: has anyone else seen this?

Who is "buffer.tlog.xyz", do they have a size/# files cap, and are they
not getting GC'ed fast enough under this kind of load?

The 400GB disk is normally at ~90% empty, "not enough space on the disk"
does not sound right. The logs do pile up when this happens and Java
starts dumping gigabytes of stack traces, but they add up to few 100 MBs
at most.  There certainly was *some* free space when I got to it, and
it's back to 99% free after Solr restart.

Any suggestions as to how to deal with this?

(Obviously, I added "Disallow: /" to robots.txt for GPTBot, but that's
only good until the next bot comes along.)

TIA
Dima

Reply via email to