solr allows you to go into page=1000 or whatever, bots will follow it, but there is rarely any business value for going so deep.
You can come up with a scheme for cursormarks + caching (faster than paging) or just stop showing results past page 5-10. On Thu, Jun 20, 2024 at 11:39 AM Dmitri Maziuk <dmitri.maz...@gmail.com> wrote: > > Hi all, > > the latest mole in the eternal whack-a-mole game with web crawlers > (GPTBot) DoS'ed our Solr again & I took a closer look at the logs. > Here's what it looks like is happening: > > - the bot is hitting a URL backed by Solr search and starts following > all permutations of facets and "next page"s at a rate of 60+ hits/second. > - Solr is not returning the results fast enough and the bot is dropping > connections. > - An INFO message is logged: jetty is "unable to write response, client > closed connection or we are shutting down" -- IOException on the > OutputStream: Closed. > > These go on for a while until: > > java.nio.file.FileSystemException: > $PATH_TO\server\solr\preview_shard1_replica_n2\data\tlog\buffer.tlog.0000800034318988100: > The process cannot access the file because it is being used by another > process. > -- Different file suffix # on every one of those > > And eventually an update comes in and fails with > > ERROR (qtp173791568-23140) [c:preview s:shard1 r:core_node4 > x:preview_shard1_replica_n2] o.a.s.h.RequestHandlerBase > org.apache.solr.common.SolrException: Error logging add => > org.apache.solr.common.SolrException: Error logging add > at > org.apache.solr.update.TransactionLog.write(TransactionLog.java:420) > org.apache.solr.common.SolrException: Error logging add > > Caused by: java.io.IOException: There is not enough space on the disk > ... > > At this point Solr is hosed. Admin page shows "no collections available" > but does respond to queries; all queries from the website client (.NET) > are failing. > > This is Solr 8-11.2 on winders server 2022/correto JVM 11. > > So, questions: has anyone else seen this? > > Who is "buffer.tlog.xyz", do they have a size/# files cap, and are they > not getting GC'ed fast enough under this kind of load? > > The 400GB disk is normally at ~90% empty, "not enough space on the disk" > does not sound right. The logs do pile up when this happens and Java > starts dumping gigabytes of stack traces, but they add up to few 100 MBs > at most. There certainly was *some* free space when I got to it, and > it's back to 99% free after Solr restart. > > Any suggestions as to how to deal with this? > > (Obviously, I added "Disallow: /" to robots.txt for GPTBot, but that's > only good until the next bot comes along.) > > TIA > Dima >