[
https://jira.duraspace.org/browse/DS-1073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=23092#comment-23092
]
Mark H. Wood commented on DS-1073:
----------------------------------
As Richard Rodgers points out, the -maximum flag is to limit the amount of work
done *filtering*, so it isn't currently useless. But the point is well made
that very large installations are going to waste a lot of cycles just grubbing
through the tables looking for work to do, and that waste increases
proportionally as the repo. grows. Objects enter the system and (typically)
never leave it. This suggests to me that, in normal operation, objects ought
to be processed on entry -- that is, everyday filtering should happen upon
acceptance into a collection, with filter-media reserved for special cases such
as "we only just decided that we want thumbnails for everything in our
ten-year-old repository".
We should re-examine *all* of our periodic processes that visit every object.
These tasks might be better served by capturing events into a queue for
background processing of only objects which require such processing.
Grovelling over the whole Item table in forward *or* reverse order by
last_modified still has us examining the entire table, most of which requires
no action, *and* adds the ordering overhead. We'd want to change the schema --
there is no index on last_modified, so ORDER BY would be sorting those million
rows every night!
If we want to continue to do filtering periodically rather than event-driven,
then "modified in the last N intervals" would help a good deal. "SELECT
item_id FROM item WHERE in_archive AND last_modified > $start_stamp ORDER BY
last_modified DESC" should be fairly cheap *provided* we index last_modified:
walk the last_modified_idx until the second WHERE condition is unsatisfied, if
the query optimizer is that clever. But we're still throwing away information
and synthesizing it again later, because we *know* at item entry that it will
need postprocessing, if the site wants postprocessing.
> The maximum flag on filter-media is useless if results are returned in the
> same order every time
> ------------------------------------------------------------------------------------------------
>
> Key: DS-1073
> URL: https://jira.duraspace.org/browse/DS-1073
> Project: DSpace
> Issue Type: Bug
> Components: DSpace API
> Affects Versions: 1.8.0
> Reporter: Samuel Ottenhoff
> Attachments: DS-1073.patch
>
>
> Scenario: institution has a million PDFs on one sever and needs to run
> filter-media every night. Institution only wants to run on 10k PDFs per night.
> There is a "-m" flag to set a maximum. But the results are returned the same
> way every time preventing new items from being picked up.
> Possible solutions:
> 1) Return items sorted by recently updated?
> 2) Return a random sort of elements instead of the same ones every time?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://jira.duraspace.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure
contains a definitive record of customers, application performance,
security threats, fraudulent activity, and more. Splunk takes this
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Dspace-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-devel