The DorkBot queries I'm referring to look like this: [02/Dec/2021:12:08:13 -0800] "GET /eg/opac/results?do_basket_action=Go&query=1&detail_record_view=1&search-submit-go=Search&no_highlight=1&modifier=metabib&select_basket_action=1&qtype=keyword%27%22&fg%3Amat_format=1&locg=176&sort=1 HTTP/1.0" 200 62417 "-" "UT-Dorkbot/1.0"
they vary after metabib, but all are using the basket feature. They come from different library branch URLs. -Jon On Fri, Dec 3, 2021 at 10:45 AM JonGeorg SageLibrary < jongeorg.sagelibr...@gmail.com> wrote: > Yeah, I'm not seeing any /opac/extras/unapi requests in the Apache logs. > Is DorkBot used legitimately for querying the opac? > -Jon > > On Fri, Dec 3, 2021 at 10:37 AM JonGeorg SageLibrary < > jongeorg.sagelibr...@gmail.com> wrote: > >> Thank you! >> -Jon >> >> On Fri, Dec 3, 2021 at 8:10 AM Blake Henderson via Evergreen-general < >> evergreen-general@list.evergreen-ils.org> wrote: >> >>> JonGeorg, >>> >>> This reminds me of a similar issues that we had. We resolved it with >>> this change to NGINX. Here's the link: >>> >>> >>> https://git.evergreen-ils.org/?p=working/OpenSRF.git;a=shortlog;h=refs/heads/user/blake/LP1913610_nginx_request_limits >>> >>> and the bug: >>> https://bugs.launchpad.net/evergreen/+bug/1913610 >>> >>> I'm not sure that it's the same issue though, as you've shared a search >>> SQL query and this solution addresses external requests to >>> "/opac/extras/unapi" >>> But you might be able to apply the same nginx rate limiting technique >>> here if you can detect the URL they are using. >>> >>> There is a tool called "apachetop" which I used in order to see the >>> URL's that were being used. >>> >>> apt-get -y install apachetop && apachetop -f >>> /var/log/apache2/other_vhosts_access.log >>> >>> and another useful command: >>> >>> cat /var/log/apache2/other_vhosts_access.log | awk '{print $2}' | sort | >>> uniq -c | sort -rn >>> >>> You have to ignore (not limit) all the requests to the Evergreen gateway >>> as most of that traffic is the staff client and should (probably) not be >>> limited. >>> >>> I'm just throwing some ideas out there for you. Good luck! >>> >>> -Blake- >>> Conducting Magic >>> Can consume data in any format >>> MOBIUS >>> >>> On 12/2/2021 9:07 PM, JonGeorg SageLibrary via Evergreen-general wrote: >>> >>> I tried that and still got the loopback address, after restarting >>> services. Any other ideas? And the robots.txt file seems to be doing >>> nothing, which is not much of a surprise. I've reached out to the people >>> who host our network and have control of everything on the other side of >>> the firewall. >>> -Jon >>> >>> >>> On Wed, Dec 1, 2021 at 3:57 AM Jason Stephenson <ja...@sigio.com> wrote: >>> >>>> JonGeorg, >>>> >>>> If you're using nginx as a proxy, that may be the configuration of >>>> Apache and nginx. >>>> >>>> First, make sure that mod_remote_ip is installed and enabled for Apache >>>> 2. >>>> >>>> Then, in eg_vhost.conf, find the 3 lines the begin with >>>> "RemoteIPInternalProxy 127.0.0.1/24" and uncomment them. >>>> >>>> Next, see what header Apache checks for the remote IP address. In my >>>> example it is "RemoteIPHeader X-Forwarded-For" >>>> >>>> Next, make sure that the following two lines appear in BOTH "location >>>> /" >>>> blocks in the ngins configuration: >>>> >>>> proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; >>>> proxy_set_header X-Forwarded-Proto $scheme; >>>> >>>> After reloading/restarting nginx and Apache, you should start seeing >>>> remote IP addresses in the Apache logs. >>>> >>>> Hope that helps! >>>> Jason >>>> >>>> >>>> On 12/1/21 12:53 AM, JonGeorg SageLibrary wrote: >>>> > Because we're behind a firewall, all the addresses display as >>>> 127.0.0.1. >>>> > I can talk to the people who administer the firewall though about >>>> > blocking IP's. Thanks >>>> > -Jon >>>> > >>>> > On Tue, Nov 30, 2021 at 8:20 PM Jason Stephenson via >>>> Evergreen-general >>>> > <evergreen-general@list.evergreen-ils.org >>>> > <mailto:evergreen-general@list.evergreen-ils.org>> wrote: >>>> > >>>> > JonGeorg, >>>> > >>>> > Check your Apache logs for the source IP addresses. If you can't >>>> find >>>> > them, I can share the correct configuration for Apache with Nginx >>>> so >>>> > that you will get the addresses logged. >>>> > >>>> > Once you know the IP address ranges, block them. If you have a >>>> > firewall, >>>> > I suggest you block them there. If not, you can block them in >>>> Nginx or >>>> > in your load balancer configuration if you have one and it allows >>>> that. >>>> > >>>> > You may think you want your catalog to show up in search engines, >>>> but >>>> > bad bots will lie about who they are. All you can do with >>>> misbehaving >>>> > bots is to block them. >>>> > >>>> > HtH, >>>> > Jason >>>> > >>>> > On 11/30/21 9:34 PM, JonGeorg SageLibrary via Evergreen-general >>>> wrote: >>>> > > Question. We've been getting hammered by search engine bots >>>> [?], but >>>> > > they seem to all query our system at the same time. Enough >>>> that it's >>>> > > crashing the app servers. We have a robots.txt file in place. >>>> I've >>>> > > increased the crawling delay speed from 3 to 10 seconds, and >>>> have >>>> > > explicitly disallowed the specific bots, but I've seen no >>>> change >>>> > from >>>> > > the worst offenders - Bingbot and UT-Dorkbot. We had over 4k >>>> hits >>>> > from >>>> > > Dorkbot alone from 2pm-5pm today, and over 5k from Bingbot in >>>> the >>>> > same >>>> > > timeframe. All a couple hours after I made the changes to the >>>> robots >>>> > > file and restarted apache services. Which out of 100k entries >>>> in the >>>> > > vhosts files in that time frame doesn't sound like a lot, but >>>> the >>>> > rest >>>> > > of the traffic looks normal. This issue has been happening >>>> > > intermittently [last 3 are 11/30, 11/3, 7/20] for a while, and >>>> > the only >>>> > > thing that seems to work is to manually kill the services on >>>> the DB >>>> > > servers and restart services on the application servers. >>>> > > >>>> > > The symptom is an immediate spike in the Database CPU load. I >>>> start >>>> > > killing all queries older than 2 minutes, but it still usually >>>> > > overwhelms the system causing the app servers to stop serving >>>> > requests. >>>> > > The stuck queries are almost always ones along the lines of: >>>> > > >>>> > > -- bib search: #CD_documentLength #CD_meanHarmonic >>>> #CD_uniqueWords >>>> > > from_metarecord(*/BIB_RECORD#/*) core_limit(100000) >>>> > > badge_orgs(1,138,151) estimation_strategy(inclusion) >>>> skip_check(0) >>>> > > check_limit(1000) sort(1) filter_group_entry(1) 1 >>>> > > site(*/LIBRARY_BRANCH/*) depth(2) >>>> > > + >>>> > > | | WITH w AS ( >>>> > > | | WITH */STRING/*_keyword_xq AS >>>> (SELECT >>>> > > + >>>> > > | | (to_tsquery('english_nostop', >>>> > > COALESCE(NULLIF( '(' || >>>> > > >>>> > >>>> >>>> btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')), >>>> > >>>> > > */LONG_STRING/*))),E'(?:\\s+|:)','&','g'),'&|') || ')', '()'), >>>> > '')) || >>>> > > to_tsquery('simple', COALESCE(NULLIF( '(' || >>>> > > >>>> > >>>> >>>> btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')), >>>> > >>>> > > */LONG_STRING/*))),E'(?:\\s+|:)','&','g'),'&|') || ')', '()'), >>>> > ''))) AS >>>> > > tsq,+ >>>> > > | | (to_tsquery('english_nostop', >>>> > > COALESCE(NULLIF( '(' || >>>> > > btrim(regexp_replace(split_date_range(search_normalize >>>> > > 00:02:17.319491 | */STRING/* | >>>> > > >>>> > > And the queries by DorkBot look like they could be starting the >>>> > query >>>> > > since it's using the basket function in the OPAC. >>>> > > >>>> > > "GET >>>> > > >>>> > >>>> >>>> /eg/opac/results?do_basket_action=Go&query=1&detail_record_view=*/LONG_STRING/*&search-submit-go=Search&no_highlight=1&modifier=metabib&select_basket_action=1&qtype=keyword&fg%3Amat_format=1&locg=112&sort=1 >>>> > >>>> > > HTTP/1.0" 500 16796 "-" "UT-Dorkbot/1.0" >>>> > > >>>> > > I've anonymized the output just to be cautious. Reports are run >>>> > off the >>>> > > backup database server, so it cannot be an auto generated >>>> > report, and it >>>> > > doesn't happen often enough for that either. At this point I'm >>>> > tempted >>>> > > to block the IP addresses. What strategies are you all using >>>> to deal >>>> > > with crawlers, and does anyone have an idea what is causing >>>> this? >>>> > > -Jon >>>> > > >>>> > > _______________________________________________ >>>> > > Evergreen-general mailing list >>>> > > Evergreen-general@list.evergreen-ils.org >>>> > <mailto:Evergreen-general@list.evergreen-ils.org> >>>> > > >>>> > >>>> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general >>>> > < >>>> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general >>>> > >>>> > > >>>> > _______________________________________________ >>>> > Evergreen-general mailing list >>>> > Evergreen-general@list.evergreen-ils.org >>>> > <mailto:Evergreen-general@list.evergreen-ils.org> >>>> > >>>> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general >>>> > < >>>> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general >>>> > >>>> > >>>> >>> >>> _______________________________________________ >>> Evergreen-general mailing >>> listEvergreen-general@list.evergreen-ils.orghttp://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general >>> >>> >>> _______________________________________________ >>> Evergreen-general mailing list >>> Evergreen-general@list.evergreen-ils.org >>> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general >>> >>
_______________________________________________ Evergreen-general mailing list Evergreen-general@list.evergreen-ils.org http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general