Dear Alan,

Many thanks for your prompt response. Today, before I wrote my response to 
you, I restarted my Tomcat many times because of the "Cannot get a 
connection, pool error Timeout waiting for idle object" issue. I updated my 
Tomcat's crawler session manager valve based from your response here. 
Unfortunately, I cannot apply some of the methods you used because we are 
using Apache 2.4 in front of Tomcat and I don't know how to translate your 
Nginx configuration into Apache configuration. I also discovered that when 
I run 'dspace stats-util -u' to update the spider files, it resulted 
in java.lang.NullPointerException because the site iplists.com has been 
suspended.

I also modified the connection parameters by increasing the 
db.maxconnections, db.maxwait, and db.maxidle as suggested by Bram in the 
DCAT meeting that I mentioned earlier. I hope this would at least stabilize 
our repository for now.

Many thanks again, the methods you posted here is very valuable not only to 
me but also for others that may be experiencing the same issues with these 
crawlers.

Best regards,
Euler

On Thursday, July 9, 2020 at 4:10:21 PM UTC+8, Alan Orth wrote:
>
> Dear Euler,
>
> It's a constant struggle. You absolutely have to get aggressive with 
> non-human users. I have adopted a multi-faceted effort. I'm happy to share, 
> everything of ours is in open GitHub repositories.
>
> 1. Tagging and throttling bad bots in nginx (which sits in front of 
> Tomcat): 
> https://github.com/ilri/rmg-ansible-public/blob/master/roles/dspace/templates/nginx/default.conf.j2
> 2. Force users with "bot" user agents to use the same JSESSION ID in 
> Tomcat Crawler Session Manager Valve: 
> https://github.com/ilri/rmg-ansible-public/blob/master/roles/dspace/templates/tomcat/server-tomcat7.xml.j2#L210
> 3. Update DSpace's builtin "spider" user agents lists so it doesn't record 
> them in Solr stats (most of those come from COUNTER-Robots project): 
> https://github.com/ilri/DSpace/tree/5_x-prod/dspace/config/spiders/agents
> 4. Aggressive PostgreSQL connection pooling in Tomcat JDBC (requires 
> special configuration in Tomcat contexts as well): 
> https://github.com/ilri/rmg-ansible-public/blob/master/roles/dspace/templates/tomcat/server-tomcat7.xml.j2#L50
>
> This has made our site more stable, but like I said it's still a constant 
> struggle. For a few months starting in 2020-04 or so we've had an 
> increasing number of "waiting for lock" connections in both production 5.x 
> and testing 6.x. I've tried upgrading PostgreSQL, upgrading the JDBC 
> driver, downgrading Tomcat, etc. Nothing works except restarting Tomcat.
>
> Would love to restart the discussion on all of this... BTW it helps if 
> your systems have something like Munin configured that graphs the 
> PostgreSQL connection status every five minutes. Helpful when some user 
> says they couldn't log in or submit an item yesterday afternoon.
>
> Regards,
>
> On Thu, Jul 9, 2020 at 10:10 AM euler <[email protected] <javascript:>> 
> wrote:
>
>> Hello Alan,
>>
>> This thread is 3 years old now but our repositories are still 
>> experiencing the issues mentioned here. We are running DSpace 6.3 by the 
>> way. I've read from one of your CGSpace notes (
>> https://alanorth.github.io/cgspace-notes/2018-11/) that when you 
>> encountered crawlers making a lot of requests from different IP addresses, 
>> you add them to your Tomcat's crawler session manager. I highly suspect 
>> that the cause of our repository 'hanging' is also because of this amount 
>> of requests from this crawlers (mostly from crawlers with user 
>> agent facebookexternalhit, Turnitin, and Unpaywall).
>>
>> With regards to this, I hope you don't mind if you can share here the 
>> setting of your Tomcat's crawler session manager. Note that I have modified 
>> my postgresql.conf based from the discussion mentioned here: 
>> https://wiki.lyrasis.org/display/cmtygp/DCAT+Meeting+April+2017.
>>
>> Thanks and hoping for your positive response,
>> euler
>>
>> On Friday, July 7, 2017 at 7:41:52 PM UTC+8, Alan Orth wrote:
>>>
>>> Hello,
>>>
>>> I've struggled with this in various forms over the seven years or so 
>>> we've been running DSpace. High load on public servers can easily exhaust 
>>> PostgreSQL connection slots. The easy answer is to increase the connection 
>>> limits, but before that it's better to understand why the system load is 
>>> increasing. Here are a few tips.
>>>
>>> The easiest thing is to enable DSpace's XML sitemaps. Search engines 
>>> like Google really hammer the repository as they crawl and click all sorts 
>>> of dynamic links in the Browse and Discovery sidebar. Instead, you register 
>>> your web property with Google Webmaster Tools and give them the path to 
>>> your sitemap so they can get to each item directly without crawling 
>>> haphazardly. Once you're sure Google is consuming your sitemap, you can 
>>> block them from the dynamic pages in robots.txt. Here's the link on the 
>>> wiki for DSpace 4:
>>>
>>> https://wiki.duraspace.org/display/DSDOC4x/Search+Engine+Optimization
>>>
>>> Second, look at your web server access logs. You might see many requests 
>>> from bots like Bing, Yandex, Google, Slurp, etc, and notice they will all 
>>> becoming from different IP addresses—sometimes from five or ten 
>>> concurrently! Another place you might see this is in the "Current Activity" 
>>> tab in the DSpace Admin UI control panel. The problem with this is that 
>>> each of these connections creates a new Tomcat session, which consumes 
>>> precious memory, CPU, and other resources. You can enable a Crawler Session 
>>> Manager Valve in your Tomcat config which will tell Tomcat to make all user 
>>> agents matching a certain pattern use a single session. There are some 
>>> notes from me in the comments here:
>>>
>>> https://wiki.duraspace.org/display/cmtygp/DCAT+Meeting+April+2017
>>>
>>> And finally, in the last link is a discussion about updating the DSpace 
>>> defaults for PostgreSQL connections from a recently developers meeting.
>>>
>>> I hope that helps. Cheers,
>>>
>>> On Fri, Jul 7, 2017 at 12:57 AM christian criollo <[email protected]> 
>>> wrote:
>>>
>>>> Hello Alan
>>>>
>>>> yes the repository is public thanks for your answer
>>>>
>>>>
>>>> El jueves, 6 de julio de 2017, 2:09:59 (UTC-5), Alan Orth escribió:
>>>>
>>>>> Hello,
>>>>>
>>>>> Is your repository public? It could be that you are getting lots of 
>>>>> traffic from search bots or people harvesting via REST / OAI... this 
>>>>> would 
>>>>> definitely increase the load on the server and create more database 
>>>>> connections.
>>>>>
>>>>> Ciao,
>>>>>
>>>>> On Wed, May 24, 2017 at 11:23 PM christian criollo <[email protected]> 
>>>>> wrote:
>>>>>
>>>>
>>>>>> Hi everybody
>>>>>>
>>>>>> the last month, our repository is presenting faults like that 
>>>>>> *org.apache.commons.dbcp.SQLNestedException: 
>>>>>> Cannot get a connection, pool error Timeout waiting for idle object,  *I 
>>>>>> modified dspace.cfg the variables max connection to 100, but the system 
>>>>>> still the same, I watch that  sessions in tomcat increase obstinately, i 
>>>>>> dont know whats wrong , please if somebody can tell me what can i do to 
>>>>>> fix 
>>>>>> this error, thanks for the help.  
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "DSpace Technical Support" group.
>>>>>>
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>>> an email to [email protected].
>>>>>> To post to this group, send email to [email protected].
>>>>>
>>>>>
>>>>>> Visit this group at https://groups.google.com/group/dspace-tech.
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>> -- 
>>>>> Alan Orth
>>>>> [email protected]
>>>>>
>>>> https://picturingjordan.com
>>>>> https://englishbulgaria.net
>>>>> https://mjanja.ch
>>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "DSpace Technical Support" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/dspace-tech.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>> -- 
>>> Alan Orth
>>> [email protected]
>>> https://picturingjordan.com
>>> https://englishbulgaria.net
>>> https://mjanja.ch
>>>
>> -- 
>> All messages to this mailing list should adhere to the DuraSpace Code of 
>> Conduct: https://duraspace.org/about/policies/code-of-conduct/
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "DSpace Technical Support" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/dspace-tech/4990e2ca-88d7-453e-8f6f-9f859494637do%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/dspace-tech/4990e2ca-88d7-453e-8f6f-9f859494637do%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>
>
> -- 
> Alan Orth
> [email protected] <javascript:>
> https://picturingjordan.com
> https://englishbulgaria.net
> https://mjanja.ch
>

-- 
All messages to this mailing list should adhere to the DuraSpace Code of 
Conduct: https://duraspace.org/about/policies/code-of-conduct/
--- 
You received this message because you are subscribed to the Google Groups 
"DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/dspace-tech/584f648f-9027-4d7a-8409-cc59fe275f3fo%40googlegroups.com.

Reply via email to