[ https://jira.duraspace.org/browse/DS-599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=22587#comment-22587 ]
Peter Dietz commented on DS-599: -------------------------------- I'm glad the query works now. The update script is very slow. Especially, if you have millions of docs that need to be re-done. I imagine some of the pain of the reindex is from us not being to update solr docs. i.e. we have to copy original doc, alter the copy, add the copy, delete original doc. Another idea for optimization is that we have millions of docs, but probably just 100,000 bitstreams. So instead of processing each doc individually (query solr, get bundle name from DB, do hackish solr update), we could squeeze some performance by reducing the number of DB queries. I'm thinking we can build an in-memory object to cache the bitstreamID's and bundleNames. Ask solr for a facet of the bitstreamID's that need to be processed (facet.mincount=1, facet.limit=-1). Query DB in a single big query for the bundle name for each of those, and store the values into the cache object. Things that have no bundle (collection-logo, comm-logo, deleted bitstream) will have to get separate handling. Anyways.. with cached object, then it becomes a matter of how fast we can process loops, and crunch solr. As opposed to having to round-trip the database for every hit. I think SOLR already chunks the processing into groups of 10 hits.. It can be an experiment to boost that to a higher number. So.. thats just a few thoughts... > SOLR statistics file download displays all files and not only those in the > Bundle Original > ------------------------------------------------------------------------------------------ > > Key: DS-599 > URL: https://jira.duraspace.org/browse/DS-599 > Project: DSpace > Issue Type: Bug > Components: Solr > Affects Versions: 1.6.0, 1.6.1, 1.6.2 > Reporter: Claudia Jürgen > Assignee: Kevin Van de Velde > Priority: Major > Fix For: 1.8.0 > > Attachments: DS-559--AddBundleNameToSOLR.patch, > DS-559--AddBundleNameToSOLR_V0_2.patch, > DS-559--AddBundleNameToSOLR_V0_3.patch, Original_bundle_bugfix.patch > > > The file download statistic for an item displays all the bitstreams > regardless of the bundle they belong to. > So licenses, extracted texts got displayed and counted. This is a bit > confusing for the normal user as their existence is usually hidden from him. > Furthermore I wonder whether views from the edit item stage should be counted > as "regular" views at all. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://jira.duraspace.org/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------------------------------ All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2dcopy2 _______________________________________________ Dspace-devel mailing list Dspace-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-devel