[ 
https://jira.duraspace.org/browse/DS-599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=22587#comment-22587
 ] 

Peter Dietz commented on DS-599:
--------------------------------

I'm glad the query works now. 

The update script is very slow. Especially, if you have millions of docs that 
need to be re-done. I imagine some of the pain of the reindex is from us not 
being to update solr docs. i.e. we have to copy original doc, alter the copy, 
add the copy, delete original doc.

Another idea for optimization is that we have millions of docs, but probably 
just 100,000 bitstreams. So instead of processing each doc individually (query 
solr, get bundle name from DB, do hackish solr update), we could squeeze some 
performance by reducing the number of DB queries. I'm thinking we can build an 
in-memory object to cache the bitstreamID's and bundleNames. Ask solr for a 
facet of the bitstreamID's that need to be processed (facet.mincount=1, 
facet.limit=-1). Query DB in a single big query for the bundle name for each of 
those, and store the values into the cache object. Things that have no bundle 
(collection-logo, comm-logo, deleted bitstream) will have to get separate 
handling. Anyways.. with cached object, then it becomes a matter of how fast we 
can process loops, and crunch solr. As opposed to having to round-trip the 
database for every hit.

I think SOLR already chunks the processing into groups of 10 hits.. It can be 
an experiment to boost that to a higher number.

So.. thats just a few thoughts...
                
> SOLR statistics file download displays all files and not only those in the 
> Bundle Original
> ------------------------------------------------------------------------------------------
>
>                 Key: DS-599
>                 URL: https://jira.duraspace.org/browse/DS-599
>             Project: DSpace
>          Issue Type: Bug
>          Components: Solr
>    Affects Versions: 1.6.0, 1.6.1, 1.6.2
>            Reporter: Claudia Jürgen
>            Assignee: Kevin Van de Velde
>            Priority: Major
>             Fix For: 1.8.0
>
>         Attachments: DS-559--AddBundleNameToSOLR.patch, 
> DS-559--AddBundleNameToSOLR_V0_2.patch, 
> DS-559--AddBundleNameToSOLR_V0_3.patch, Original_bundle_bugfix.patch
>
>
> The file download statistic for an item displays all the bitstreams 
> regardless of the bundle they belong to.
> So licenses, extracted texts got displayed and counted.  This is a bit 
> confusing for the normal user as their existence is usually hidden from him.
> Furthermore I wonder whether views from the edit item stage should be counted 
> as "regular" views at all.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://jira.duraspace.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security
threats, fraudulent activity, and more. Splunk takes this data and makes
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2dcopy2
_______________________________________________
Dspace-devel mailing list
Dspace-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-devel

Reply via email to