The LONG STRING sometimes contains a word, but it's usually just a string of numbers repeated, like this- $_78110$[$_78110$, $_78110$$_78110$), $_78110$]$_78110$, $_78110$$_78110$. The numbers change which is why I suspect it's a SQL injection attempt.
I agree re blocking by IP's. I didn't set the robots file crawl time any higher as I wanted to see what, if any, effect the initial change had during an attack. -Jon On Wed, Dec 1, 2021 at 11:27 AM Jeff Davis via Evergreen-general < evergreen-general@list.evergreen-ils.org> wrote: > Our robots.txt file (https://catalogue.libraries.coop/robots.txt) > throttles Googlebot and Bingbot to 60 seconds and disallows certain > other crawlers entirely. So even 10 seconds seems generous to me. > > Of course, robots.txt will only be respected by well-behaved crawlers; > there's nothing preventing a bot from ignoring it (in which case, as > Jason says, your best bet may be to block the offending IP). > > Is the "LONG_STRING" in your examples a legitimate search -- i.e, no > unusual characters or obvious SQL injection attempts? Does it contain > complex nesting of search terms? > > Jeff > > > On 2021-11-30 6:34 p.m., JonGeorg SageLibrary via Evergreen-general wrote: > > Question. We've been getting hammered by search engine bots [?], but > > they seem to all query our system at the same time. Enough that it's > > crashing the app servers. We have a robots.txt file in place. I've > > increased the crawling delay speed from 3 to 10 seconds, and have > > explicitly disallowed the specific bots, but I've seen no change from > > the worst offenders - Bingbot and UT-Dorkbot. We had over 4k hits from > > Dorkbot alone from 2pm-5pm today, and over 5k from Bingbot in the same > > timeframe. All a couple hours after I made the changes to the robots > > file and restarted apache services. Which out of 100k entries in the > > vhosts files in that time frame doesn't sound like a lot, but the rest > > of the traffic looks normal. This issue has been happening > > intermittently [last 3 are 11/30, 11/3, 7/20] for a while, and the only > > thing that seems to work is to manually kill the services on the DB > > servers and restart services on the application servers. > > > > The symptom is an immediate spike in the Database CPU load. I start > > killing all queries older than 2 minutes, but it still usually > > overwhelms the system causing the app servers to stop serving requests. > > The stuck queries are almost always ones along the lines of: > > > > -- bib search: #CD_documentLength #CD_meanHarmonic #CD_uniqueWords > > from_metarecord(*/BIB_RECORD#/*) core_limit(100000) > > badge_orgs(1,138,151) estimation_strategy(inclusion) skip_check(0) > > check_limit(1000) sort(1) filter_group_entry(1) 1 > > site(*/LIBRARY_BRANCH/*) depth(2) > > + > > | | WITH w AS ( > > | | WITH */STRING/*_keyword_xq AS (SELECT > > + > > | | (to_tsquery('english_nostop', > > COALESCE(NULLIF( '(' || > > > btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')), > > > */LONG_STRING/*))),E'(?:\\s+|:)','&','g'),'&|') || ')', '()'), '')) || > > to_tsquery('simple', COALESCE(NULLIF( '(' || > > > btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')), > > > */LONG_STRING/*))),E'(?:\\s+|:)','&','g'),'&|') || ')', '()'), ''))) AS > > tsq,+ > > | | (to_tsquery('english_nostop', > > COALESCE(NULLIF( '(' || > > btrim(regexp_replace(split_date_range(search_normalize > > 00:02:17.319491 | */STRING/* | > > > > And the queries by DorkBot look like they could be starting the query > > since it's using the basket function in the OPAC. > > > > "GET > > > /eg/opac/results?do_basket_action=Go&query=1&detail_record_view=*/LONG_STRING/*&search-submit-go=Search&no_highlight=1&modifier=metabib&select_basket_action=1&qtype=keyword&fg%3Amat_format=1&locg=112&sort=1 > > > HTTP/1.0" 500 16796 "-" "UT-Dorkbot/1.0" > > > > I've anonymized the output just to be cautious. Reports are run off the > > backup database server, so it cannot be an auto generated report, and it > > doesn't happen often enough for that either. At this point I'm tempted > > to block the IP addresses. What strategies are you all using to deal > > with crawlers, and does anyone have an idea what is causing this? > > -Jon > > > > _______________________________________________ > > Evergreen-general mailing list > > Evergreen-general@list.evergreen-ils.org > > http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general > > > _______________________________________________ > Evergreen-general mailing list > Evergreen-general@list.evergreen-ils.org > http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general >
_______________________________________________ Evergreen-general mailing list Evergreen-general@list.evergreen-ils.org http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general