solr allows you to go into page=1000 or whatever, bots will follow it,
but there is rarely any business value for going so deep.

You can come up with a scheme for cursormarks + caching (faster than
paging) or just stop showing results past page 5-10.

On Thu, Jun 20, 2024 at 11:39 AM Dmitri Maziuk <dmitri.maz...@gmail.com> wrote:
>
> Hi all,
>
> the latest mole in the eternal whack-a-mole game with web crawlers
> (GPTBot) DoS'ed our Solr again & I took a closer look at the logs.
> Here's what it looks like is happening:
>
> - the bot is hitting a URL backed by Solr search and starts following
> all permutations of facets and "next page"s at a rate of 60+ hits/second.
> - Solr is not returning the results fast enough and the bot is dropping
> connections.
> - An INFO message is logged: jetty is "unable to write response, client
> closed connection or we are shutting down" -- IOException on the
> OutputStream: Closed.
>
> These go on for a while until:
>
> java.nio.file.FileSystemException:
> $PATH_TO\server\solr\preview_shard1_replica_n2\data\tlog\buffer.tlog.0000800034318988100:
> The process cannot access the file because it is being used by another
> process.
>   -- Different file suffix # on every one of those
>
> And eventually an update comes in and fails with
>
> ERROR (qtp173791568-23140) [c:preview s:shard1 r:core_node4
> x:preview_shard1_replica_n2] o.a.s.h.RequestHandlerBase
> org.apache.solr.common.SolrException: Error logging add =>
> org.apache.solr.common.SolrException: Error logging add
> at
> org.apache.solr.update.TransactionLog.write(TransactionLog.java:420)
> org.apache.solr.common.SolrException: Error logging add
>
> Caused by: java.io.IOException: There is not enough space on the disk
> ...
>
> At this point Solr is hosed. Admin page shows "no collections available"
> but does respond to queries; all queries from the website client (.NET)
> are failing.
>
> This is Solr 8-11.2 on winders server 2022/correto JVM 11.
>
> So, questions: has anyone else seen this?
>
> Who is "buffer.tlog.xyz", do they have a size/# files cap, and are they
> not getting GC'ed fast enough under this kind of load?
>
> The 400GB disk is normally at ~90% empty, "not enough space on the disk"
> does not sound right. The logs do pile up when this happens and Java
> starts dumping gigabytes of stack traces, but they add up to few 100 MBs
> at most.  There certainly was *some* free space when I got to it, and
> it's back to 99% free after Solr restart.
>
> Any suggestions as to how to deal with this?
>
> (Obviously, I added "Disallow: /" to robots.txt for GPTBot, but that's
> only good until the next bot comes along.)
>
> TIA
> Dima
>

Reply via email to