[ 
https://jira.duraspace.org/browse/DS-1073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=23092#comment-23092
 ] 

Mark H. Wood commented on DS-1073:
----------------------------------

As Richard Rodgers points out, the -maximum flag is to limit the amount of work 
done *filtering*, so it isn't currently useless.  But the point is well made 
that very large installations are going to waste a lot of cycles just grubbing 
through the tables looking for work to do, and that waste increases 
proportionally as the repo. grows.  Objects enter the system and (typically) 
never leave it.  This suggests to me that, in normal operation, objects ought 
to be processed on entry -- that is, everyday filtering should happen upon 
acceptance into a collection, with filter-media reserved for special cases such 
as "we only just decided that we want thumbnails for everything in our 
ten-year-old repository".

We should re-examine *all* of our periodic processes that visit every object.  
These tasks might be better served by capturing events into a queue for 
background processing of only objects which require such processing.

Grovelling over the whole Item table in forward *or* reverse order by 
last_modified still has us examining the entire table, most of which requires 
no action, *and* adds the ordering overhead.  We'd want to change the schema -- 
there is no index on last_modified, so ORDER BY would be sorting those million 
rows every night!

If we want to continue to do filtering periodically rather than event-driven, 
then "modified in the last N intervals" would help a good deal.  "SELECT 
item_id FROM item WHERE in_archive AND last_modified > $start_stamp ORDER BY 
last_modified DESC" should be fairly cheap *provided* we index last_modified:  
walk the last_modified_idx until the second WHERE condition is unsatisfied, if 
the query optimizer is that clever.  But we're still throwing away information 
and synthesizing it again later, because we *know* at item entry that it will 
need postprocessing, if the site wants postprocessing.
                
> The maximum flag on filter-media is useless if results are returned in the 
> same order every time
> ------------------------------------------------------------------------------------------------
>
>                 Key: DS-1073
>                 URL: https://jira.duraspace.org/browse/DS-1073
>             Project: DSpace
>          Issue Type: Bug
>          Components: DSpace API
>    Affects Versions: 1.8.0
>            Reporter: Samuel Ottenhoff
>         Attachments: DS-1073.patch
>
>
> Scenario: institution has a million PDFs on one sever and needs to run 
> filter-media every night. Institution only wants to run on 10k PDFs per night.
> There is a "-m" flag to set a maximum. But the results are returned the same 
> way every time preventing new items from being picked up.
> Possible solutions:
>  1) Return items sorted by recently updated?
>   2) Return a random sort of elements instead of the same ones every time?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://jira.duraspace.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Dspace-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-devel

Reply via email to