I tried that and still got the loopback address, after restarting services. Any other ideas? And the robots.txt file seems to be doing nothing, which is not much of a surprise. I've reached out to the people who host our network and have control of everything on the other side of the firewall. -Jon
On Wed, Dec 1, 2021 at 3:57 AM Jason Stephenson <ja...@sigio.com> wrote: > JonGeorg, > > If you're using nginx as a proxy, that may be the configuration of > Apache and nginx. > > First, make sure that mod_remote_ip is installed and enabled for Apache 2. > > Then, in eg_vhost.conf, find the 3 lines the begin with > "RemoteIPInternalProxy 127.0.0.1/24" and uncomment them. > > Next, see what header Apache checks for the remote IP address. In my > example it is "RemoteIPHeader X-Forwarded-For" > > Next, make sure that the following two lines appear in BOTH "location /" > blocks in the ngins configuration: > > proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; > proxy_set_header X-Forwarded-Proto $scheme; > > After reloading/restarting nginx and Apache, you should start seeing > remote IP addresses in the Apache logs. > > Hope that helps! > Jason > > > On 12/1/21 12:53 AM, JonGeorg SageLibrary wrote: > > Because we're behind a firewall, all the addresses display as 127.0.0.1. > > I can talk to the people who administer the firewall though about > > blocking IP's. Thanks > > -Jon > > > > On Tue, Nov 30, 2021 at 8:20 PM Jason Stephenson via Evergreen-general > > <evergreen-general@list.evergreen-ils.org > > <mailto:evergreen-general@list.evergreen-ils.org>> wrote: > > > > JonGeorg, > > > > Check your Apache logs for the source IP addresses. If you can't find > > them, I can share the correct configuration for Apache with Nginx so > > that you will get the addresses logged. > > > > Once you know the IP address ranges, block them. If you have a > > firewall, > > I suggest you block them there. If not, you can block them in Nginx > or > > in your load balancer configuration if you have one and it allows > that. > > > > You may think you want your catalog to show up in search engines, but > > bad bots will lie about who they are. All you can do with misbehaving > > bots is to block them. > > > > HtH, > > Jason > > > > On 11/30/21 9:34 PM, JonGeorg SageLibrary via Evergreen-general > wrote: > > > Question. We've been getting hammered by search engine bots [?], > but > > > they seem to all query our system at the same time. Enough that > it's > > > crashing the app servers. We have a robots.txt file in place. I've > > > increased the crawling delay speed from 3 to 10 seconds, and have > > > explicitly disallowed the specific bots, but I've seen no change > > from > > > the worst offenders - Bingbot and UT-Dorkbot. We had over 4k hits > > from > > > Dorkbot alone from 2pm-5pm today, and over 5k from Bingbot in the > > same > > > timeframe. All a couple hours after I made the changes to the > robots > > > file and restarted apache services. Which out of 100k entries in > the > > > vhosts files in that time frame doesn't sound like a lot, but the > > rest > > > of the traffic looks normal. This issue has been happening > > > intermittently [last 3 are 11/30, 11/3, 7/20] for a while, and > > the only > > > thing that seems to work is to manually kill the services on the > DB > > > servers and restart services on the application servers. > > > > > > The symptom is an immediate spike in the Database CPU load. I > start > > > killing all queries older than 2 minutes, but it still usually > > > overwhelms the system causing the app servers to stop serving > > requests. > > > The stuck queries are almost always ones along the lines of: > > > > > > -- bib search: #CD_documentLength #CD_meanHarmonic #CD_uniqueWords > > > from_metarecord(*/BIB_RECORD#/*) core_limit(100000) > > > badge_orgs(1,138,151) estimation_strategy(inclusion) skip_check(0) > > > check_limit(1000) sort(1) filter_group_entry(1) 1 > > > site(*/LIBRARY_BRANCH/*) depth(2) > > > + > > > | | WITH w AS ( > > > | | WITH */STRING/*_keyword_xq AS (SELECT > > > + > > > | | (to_tsquery('english_nostop', > > > COALESCE(NULLIF( '(' || > > > > > > > btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')), > > > > > */LONG_STRING/*))),E'(?:\\s+|:)','&','g'),'&|') || ')', '()'), > > '')) || > > > to_tsquery('simple', COALESCE(NULLIF( '(' || > > > > > > > btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')), > > > > > */LONG_STRING/*))),E'(?:\\s+|:)','&','g'),'&|') || ')', '()'), > > ''))) AS > > > tsq,+ > > > | | (to_tsquery('english_nostop', > > > COALESCE(NULLIF( '(' || > > > btrim(regexp_replace(split_date_range(search_normalize > > > 00:02:17.319491 | */STRING/* | > > > > > > And the queries by DorkBot look like they could be starting the > > query > > > since it's using the basket function in the OPAC. > > > > > > "GET > > > > > > > /eg/opac/results?do_basket_action=Go&query=1&detail_record_view=*/LONG_STRING/*&search-submit-go=Search&no_highlight=1&modifier=metabib&select_basket_action=1&qtype=keyword&fg%3Amat_format=1&locg=112&sort=1 > > > > > HTTP/1.0" 500 16796 "-" "UT-Dorkbot/1.0" > > > > > > I've anonymized the output just to be cautious. Reports are run > > off the > > > backup database server, so it cannot be an auto generated > > report, and it > > > doesn't happen often enough for that either. At this point I'm > > tempted > > > to block the IP addresses. What strategies are you all using > to deal > > > with crawlers, and does anyone have an idea what is causing this? > > > -Jon > > > > > > _______________________________________________ > > > Evergreen-general mailing list > > > Evergreen-general@list.evergreen-ils.org > > <mailto:Evergreen-general@list.evergreen-ils.org> > > > > > > http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general > > < > http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general> > > > > > _______________________________________________ > > Evergreen-general mailing list > > Evergreen-general@list.evergreen-ils.org > > <mailto:Evergreen-general@list.evergreen-ils.org> > > > http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general > > < > http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general> > > >
_______________________________________________ Evergreen-general mailing list Evergreen-general@list.evergreen-ils.org http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general