Thank you! -Jon On Fri, Dec 3, 2021 at 8:10 AM Blake Henderson via Evergreen-general < evergreen-general@list.evergreen-ils.org> wrote:
> JonGeorg, > > This reminds me of a similar issues that we had. We resolved it with this > change to NGINX. Here's the link: > > > https://git.evergreen-ils.org/?p=working/OpenSRF.git;a=shortlog;h=refs/heads/user/blake/LP1913610_nginx_request_limits > > and the bug: > https://bugs.launchpad.net/evergreen/+bug/1913610 > > I'm not sure that it's the same issue though, as you've shared a search > SQL query and this solution addresses external requests to > "/opac/extras/unapi" > But you might be able to apply the same nginx rate limiting technique here > if you can detect the URL they are using. > > There is a tool called "apachetop" which I used in order to see the URL's > that were being used. > > apt-get -y install apachetop && apachetop -f > /var/log/apache2/other_vhosts_access.log > > and another useful command: > > cat /var/log/apache2/other_vhosts_access.log | awk '{print $2}' | sort | > uniq -c | sort -rn > > You have to ignore (not limit) all the requests to the Evergreen gateway > as most of that traffic is the staff client and should (probably) not be > limited. > > I'm just throwing some ideas out there for you. Good luck! > > -Blake- > Conducting Magic > Can consume data in any format > MOBIUS > > On 12/2/2021 9:07 PM, JonGeorg SageLibrary via Evergreen-general wrote: > > I tried that and still got the loopback address, after restarting > services. Any other ideas? And the robots.txt file seems to be doing > nothing, which is not much of a surprise. I've reached out to the people > who host our network and have control of everything on the other side of > the firewall. > -Jon > > > On Wed, Dec 1, 2021 at 3:57 AM Jason Stephenson <ja...@sigio.com> wrote: > >> JonGeorg, >> >> If you're using nginx as a proxy, that may be the configuration of >> Apache and nginx. >> >> First, make sure that mod_remote_ip is installed and enabled for Apache 2. >> >> Then, in eg_vhost.conf, find the 3 lines the begin with >> "RemoteIPInternalProxy 127.0.0.1/24" and uncomment them. >> >> Next, see what header Apache checks for the remote IP address. In my >> example it is "RemoteIPHeader X-Forwarded-For" >> >> Next, make sure that the following two lines appear in BOTH "location /" >> blocks in the ngins configuration: >> >> proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; >> proxy_set_header X-Forwarded-Proto $scheme; >> >> After reloading/restarting nginx and Apache, you should start seeing >> remote IP addresses in the Apache logs. >> >> Hope that helps! >> Jason >> >> >> On 12/1/21 12:53 AM, JonGeorg SageLibrary wrote: >> > Because we're behind a firewall, all the addresses display as >> 127.0.0.1. >> > I can talk to the people who administer the firewall though about >> > blocking IP's. Thanks >> > -Jon >> > >> > On Tue, Nov 30, 2021 at 8:20 PM Jason Stephenson via Evergreen-general >> > <evergreen-general@list.evergreen-ils.org >> > <mailto:evergreen-general@list.evergreen-ils.org>> wrote: >> > >> > JonGeorg, >> > >> > Check your Apache logs for the source IP addresses. If you can't >> find >> > them, I can share the correct configuration for Apache with Nginx so >> > that you will get the addresses logged. >> > >> > Once you know the IP address ranges, block them. If you have a >> > firewall, >> > I suggest you block them there. If not, you can block them in Nginx >> or >> > in your load balancer configuration if you have one and it allows >> that. >> > >> > You may think you want your catalog to show up in search engines, >> but >> > bad bots will lie about who they are. All you can do with >> misbehaving >> > bots is to block them. >> > >> > HtH, >> > Jason >> > >> > On 11/30/21 9:34 PM, JonGeorg SageLibrary via Evergreen-general >> wrote: >> > > Question. We've been getting hammered by search engine bots [?], >> but >> > > they seem to all query our system at the same time. Enough that >> it's >> > > crashing the app servers. We have a robots.txt file in place. >> I've >> > > increased the crawling delay speed from 3 to 10 seconds, and have >> > > explicitly disallowed the specific bots, but I've seen no change >> > from >> > > the worst offenders - Bingbot and UT-Dorkbot. We had over 4k hits >> > from >> > > Dorkbot alone from 2pm-5pm today, and over 5k from Bingbot in the >> > same >> > > timeframe. All a couple hours after I made the changes to the >> robots >> > > file and restarted apache services. Which out of 100k entries in >> the >> > > vhosts files in that time frame doesn't sound like a lot, but the >> > rest >> > > of the traffic looks normal. This issue has been happening >> > > intermittently [last 3 are 11/30, 11/3, 7/20] for a while, and >> > the only >> > > thing that seems to work is to manually kill the services on the >> DB >> > > servers and restart services on the application servers. >> > > >> > > The symptom is an immediate spike in the Database CPU load. I >> start >> > > killing all queries older than 2 minutes, but it still usually >> > > overwhelms the system causing the app servers to stop serving >> > requests. >> > > The stuck queries are almost always ones along the lines of: >> > > >> > > -- bib search: #CD_documentLength #CD_meanHarmonic >> #CD_uniqueWords >> > > from_metarecord(*/BIB_RECORD#/*) core_limit(100000) >> > > badge_orgs(1,138,151) estimation_strategy(inclusion) >> skip_check(0) >> > > check_limit(1000) sort(1) filter_group_entry(1) 1 >> > > site(*/LIBRARY_BRANCH/*) depth(2) >> > > + >> > > | | WITH w AS ( >> > > | | WITH */STRING/*_keyword_xq AS (SELECT >> > > + >> > > | | (to_tsquery('english_nostop', >> > > COALESCE(NULLIF( '(' || >> > > >> > >> >> btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')), >> > >> > > */LONG_STRING/*))),E'(?:\\s+|:)','&','g'),'&|') || ')', '()'), >> > '')) || >> > > to_tsquery('simple', COALESCE(NULLIF( '(' || >> > > >> > >> >> btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')), >> > >> > > */LONG_STRING/*))),E'(?:\\s+|:)','&','g'),'&|') || ')', '()'), >> > ''))) AS >> > > tsq,+ >> > > | | (to_tsquery('english_nostop', >> > > COALESCE(NULLIF( '(' || >> > > btrim(regexp_replace(split_date_range(search_normalize >> > > 00:02:17.319491 | */STRING/* | >> > > >> > > And the queries by DorkBot look like they could be starting the >> > query >> > > since it's using the basket function in the OPAC. >> > > >> > > "GET >> > > >> > >> >> /eg/opac/results?do_basket_action=Go&query=1&detail_record_view=*/LONG_STRING/*&search-submit-go=Search&no_highlight=1&modifier=metabib&select_basket_action=1&qtype=keyword&fg%3Amat_format=1&locg=112&sort=1 >> > >> > > HTTP/1.0" 500 16796 "-" "UT-Dorkbot/1.0" >> > > >> > > I've anonymized the output just to be cautious. Reports are run >> > off the >> > > backup database server, so it cannot be an auto generated >> > report, and it >> > > doesn't happen often enough for that either. At this point I'm >> > tempted >> > > to block the IP addresses. What strategies are you all using >> to deal >> > > with crawlers, and does anyone have an idea what is causing this? >> > > -Jon >> > > >> > > _______________________________________________ >> > > Evergreen-general mailing list >> > > Evergreen-general@list.evergreen-ils.org >> > <mailto:Evergreen-general@list.evergreen-ils.org> >> > > >> > >> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general >> > < >> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general> >> > > >> > _______________________________________________ >> > Evergreen-general mailing list >> > Evergreen-general@list.evergreen-ils.org >> > <mailto:Evergreen-general@list.evergreen-ils.org> >> > >> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general >> > < >> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general> >> > >> > > _______________________________________________ > Evergreen-general mailing > listEvergreen-general@list.evergreen-ils.orghttp://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general > > > _______________________________________________ > Evergreen-general mailing list > Evergreen-general@list.evergreen-ils.org > http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general >
_______________________________________________ Evergreen-general mailing list Evergreen-general@list.evergreen-ils.org http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general