[ 
https://jira.duraspace.org/browse/DS-1073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=23075#comment-23075
 ] 

Samuel Ottenhoff commented on DS-1073:
--------------------------------------

I think the issue is that if you have one million items, the checks to see if 
the derivative bitstream already exists is extremely slow, so running the 
process on a nightly basis becomes impractical.  

When I run filter-media on a large site, it takes close to 12 hours, primarily 
running through text like this:


SKIPPED: bitstream 36232 (item: 123456789/11088) because 'archive0123.jpg.jpg' 
already exists.

So the use case here is a large repository with your 100 new items per week.  
There should be a fast way to run filter-media that doesn't need to loop 
through all million records every night to accomplish the goal of thumbnailing.

                
> The maximum flag on filter-media is useless if results are returned in the 
> same order every time
> ------------------------------------------------------------------------------------------------
>
>                 Key: DS-1073
>                 URL: https://jira.duraspace.org/browse/DS-1073
>             Project: DSpace
>          Issue Type: Bug
>          Components: DSpace API
>    Affects Versions: 1.8.0
>            Reporter: Samuel Ottenhoff
>         Attachments: DS-1073.patch
>
>
> Scenario: institution has a million PDFs on one sever and needs to run 
> filter-media every night. Institution only wants to run on 10k PDFs per night.
> There is a "-m" flag to set a maximum. But the results are returned the same 
> way every time preventing new items from being picked up.
> Possible solutions:
>  1) Return items sorted by recently updated?
>   2) Return a random sort of elements instead of the same ones every time?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://jira.duraspace.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Dspace-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-devel

Reply via email to