Hi Alan,

Thanks for the note and sharing your solution.

Another possible solution would be to just ignore caching entirely for the 
search pages.

It seems like whatever time is saved in the Cocoon processing pipeline by 
having this cache would be far outweighed by not having to do all the 
database queries to look up bundles and bitstreams. As far as I can tell, 
discovery pages don't really need any bitstream information besides what is 
in the Solr index. And it seems like these bundle/bitstream queries are 
performed even when there is a valid cache (it is my understanding that the 
caching is for later Cocoon processing steps).

I'm not a Cocoon expert, though, and I haven't read through all the 
DSpaceValidity code, so I might be wrong.

Jacob

On Monday, November 5, 2018 at 1:49:00 PM UTC-6, Alan Orth wrote:
>
> Good work, Jacob.
>
> I think I'll test this on our DSpace 5.8 site as well. Our solution is 
> different: we just severely rate limit and dissuade bots from accessing 
> dynamic pages like discover, browse, search-filter, and most-popular 
> (specifically, the ones in communities and collections, because the 
> site-wide robots.txt can't use wildcards). See our nginx configuration 
> template:
>
>
> https://github.com/ilri/rmg-ansible-public/commit/1aadbb839659bcb2326fbc9bb0b2b67bf13ed7f0
>
> Cheers,
>
> On Thu, Nov 1, 2018 at 11:24 PM <kar...@gmail.com <javascript:>> wrote:
>
>> PR is at https://github.com/DSpace/DSpace/pull/2254.
>>
>> Jacob
>>
>> On Thursday, November 1, 2018 at 4:00:45 PM UTC-5, kar...@gmail.com 
>> wrote:
>>>
>>> Hi Tim,
>>>
>>> I wasn't sure if my assumption that only bitstreams in the ORIGINAL 
>>> bundle are relevant to search results cache invalidation would be valid for 
>>> all users of DSpace. 
>>>
>>> I'll go ahead and open a PR though.
>>>
>>> Jacob
>>>
>>>
>>>
>>> On Thursday, November 1, 2018 at 3:47:13 PM UTC-5, Tim Donohue wrote:
>>>>
>>>> Hi Jacob,
>>>>
>>>> Would you be willing to submit a GitHub Pull Request with the code 
>>>> changes you've made?  Or, create a ticket in our ticketing system (
>>>> https://jira.duraspace.org/browse/DS) to describe the problem and 
>>>> attach the fix? (You can request a JIRA account by just emailing 
>>>> sysa...@duraspace.org.)
>>>>
>>>> Most of the development / bug fixes and improvements to DSpace come 
>>>> from community members like yourself (and situations just like this -- 
>>>> where someone figures out a fix that is generally applicable to others).  
>>>> More on our code contribution process can be found at: 
>>>> https://wiki.duraspace.org/display/DSPACE/Code+Contribution+Guidelines 
>>>>
>>>> - Tim
>>>>
>>>> On Thu, Nov 1, 2018 at 3:43 PM <kar...@gmail.com> wrote:
>>>>
>>>>> I've figured this out!
>>>>>
>>>>> `org.dspace.app.xmlui.utils.DSpaceValidity`, which is used in 
>>>>> `AbstractSearch` to cache results, actually looks up and keys all 
>>>>> bundles, 
>>>>> then all bitstreams, for each item the search results.
>>>>>
>>>>> It seems reasonable to assume (at least for our use case) that only 
>>>>> bitstreams in the ORIGINAL bundle are relevant to search results (i.e., a 
>>>>> change in a public file is a reason to invalidate the cache, but a change 
>>>>> in non-ORIGINAL files is not).
>>>>>
>>>>> I've added a method to `DSpaceValidity` called 
>>>>> `addIfItemOnlyAddOriginalBundles`, which only keys ORIGINAL bundles for 
>>>>> an 
>>>>> `Item`, and defers to the existing `add` for everything else. I then 
>>>>> updated `AbstractSearch` to call my `addIfItemOnlyAddOriginalBundles` 
>>>>> when 
>>>>> it is adding the search result DSOs to the validity object.
>>>>>
>>>>> This has dropped my SQL query total from over 9000 to about 60, and 
>>>>> the page loads relatively fast.
>>>>>
>>>>> Unfortunately, this won't help those who have lots of bitstreams in 
>>>>> their ORIGINAL bundle, but that is perhaps unavoidable.
>>>>>
>>>>> Jacob
>>>>>
>>>>>
>>>>>
>>>>> On Wednesday, October 31, 2018 at 11:52:25 AM UTC-5, kar...@gmail.com 
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> We are running DSpace 5.9 XMLUI with Tomcat 7 and Java 8 on a RHEL 7 
>>>>>> server, with a small-ish collection of items (about 20,000). We are 
>>>>>> running 
>>>>>> production with Oracle 12, but I have replicated the same issue with 
>>>>>> Postgresql 9.2.
>>>>>>
>>>>>> We have recently noticed some very long page load times. Any given 
>>>>>> discover/search page can take 2-7 seconds to load, and when there is 
>>>>>> even a 
>>>>>> moderate amount of traffic (e.g., when a bot is indexing the site at 
>>>>>> about 
>>>>>> 10 requests per second), page load times can take 30-60 seconds or 
>>>>>> longer.
>>>>>>
>>>>>> We have made the changes suggested at 
>>>>>> https://wiki.duraspace.org/display/DSDOC5x/Performance+Tuning+DSpace 
>>>>>> for both Tomcat and PostgreSQL.
>>>>>>
>>>>>> Our production site has been customized extensively, but I was able 
>>>>>> to replicate the issue with an untouched DSpace 5.9 build using the 
>>>>>> default 
>>>>>> Mirage theme with XMLUI.
>>>>>>
>>>>>> The issue is the same with both Oracle and PostgreSQL (PostgreSQL 
>>>>>> seems a little bit better). 
>>>>>>
>>>>>> I have tried changing from Java 8 to Java 7.
>>>>>>
>>>>>> I have bumped up the database connection pool size to 300.
>>>>>>
>>>>>> Digging through the logs is difficult, since the problem only really 
>>>>>> emerges under (moderate) load.
>>>>>>
>>>>>> However, I was able to track a single page request (to /discover), 
>>>>>> and noticed that there were over 9000 individual SQL queries (for a 
>>>>>> single 
>>>>>> page load) that looked like:
>>>>>>
>>>>>> DEBUG org.dspace.storage.rdbms.DatabaseManager @ Running query "SELECT 
>>>>>> * FROM MetadataValue WHERE resource_id= ? and resource_type_id = ? ORDER 
>>>>>> BY 
>>>>>> metadata_field_id, place"  with parameters: 144458,0
>>>>>>
>>>>>> (The resource_type_id `0` is for bitstreams.)
>>>>>>
>>>>>> I *think* (but could be wrong) that this is the source of our 
>>>>>> performance problem; that the database is just getting bogged down with 
>>>>>> so 
>>>>>> many requests. Looking at PostgreSQL's slow query logging, some of these 
>>>>>> individual queries are taking about 1 second.
>>>>>>
>>>>>> Our situation is perhaps unique in that we have dozens (sometimes 
>>>>>> hundreds) of "dark" (non-ORIGINAL) archival files associated with an 
>>>>>> item, 
>>>>>> and it looks like this discover page is trying to load metadata for all 
>>>>>> of 
>>>>>> them.
>>>>>>
>>>>>> This doesn't happen with an equivalent query in JSPUI.
>>>>>>
>>>>>> Any suggestions or workarounds? Why does the search page need to get 
>>>>>> metadata for all bitstreams? 
>>>>>>
>>>>>> Does anyone know if upgrading to DSpace 6 would resolve this issue?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Jacob
>>>>>>
>>>>>>
>>>>>> -- 
>>>>> All messages to this mailing list should adhere to the DuraSpace Code 
>>>>> of Conduct: https://duraspace.org/about/policies/code-of-conduct/
>>>>> --- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "DSpace Technical Support" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to dspace-tech...@googlegroups.com.
>>>>> To post to this group, send email to dspac...@googlegroups.com.
>>>>> Visit this group at https://groups.google.com/group/dspace-tech.
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>> -- 
>>>> Tim Donohue
>>>> Technical Lead for DSpace & DSpaceDirect
>>>> DuraSpace.org | DSpace.org | DSpaceDirect.org
>>>>
>>> -- 
>> All messages to this mailing list should adhere to the DuraSpace Code of 
>> Conduct: https://duraspace.org/about/policies/code-of-conduct/
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "DSpace Technical Support" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to dspace-tech...@googlegroups.com <javascript:>.
>> To post to this group, send email to dspac...@googlegroups.com 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/dspace-tech.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> -- 
> Alan Orth
> alan...@gmail.com <javascript:>
> https://picturingjordan.com
> https://englishbulgaria.net
> https://mjanja.ch
> "In heaven all the interesting people are missing." ―Friedrich Nietzsche
>

-- 
All messages to this mailing list should adhere to the DuraSpace Code of 
Conduct: https://duraspace.org/about/policies/code-of-conduct/
--- 
You received this message because you are subscribed to the Google Groups 
"DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to dspace-tech+unsubscr...@googlegroups.com.
To post to this group, send email to dspace-tech@googlegroups.com.
Visit this group at https://groups.google.com/group/dspace-tech.
For more options, visit https://groups.google.com/d/optout.

Reply via email to