[Dspace-devel] [DSJ] Commented: (DS-470) Batch import times increase drastically as repository size increases; patch to mitigate the problem

Graham Triggs (JIRA) Tue, 26 Jan 2010 16:01:27 -0800

    [ 
http://jira.dspace.org/jira/browse/DS-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=11105#action_11105
 ]


Graham Triggs commented on DS-470:
----------------------------------

I'm looking at the possibility of having the indexer determine whether it needs 
to call pruneIndexes on a per item basis (in theory, it shouldn't need to if it 
is only adding new data). Or alternatively, treat the indexing as a 
transactional process - ie. explicit start() / inde() / index() / commit() 
operations.

pruneIndexes() is very much an internal implementation detail of how the 
indexer updates the browse tables, and as such it really should not be exposed 
as part of the public API.

> Batch import times increase drastically as repository size increases; patch 
> to mitigate the problem
> ---------------------------------------------------------------------------------------------------
>
>                 Key: DS-470
>                 URL: http://jira.dspace.org/jira/browse/DS-470
>             Project: DSpace 1.x
>          Issue Type: Improvement
>          Components: DSpace API
>    Affects Versions: 1.6.0
>            Reporter: Simon Brown
>            Priority: Minor
>         Attachments: batch_importer_speedup.patch
>
>
> As mentioned by my colleague Tom De Mulder on dspace-tech and at 
> http://tdm27.wordpress.com/2010/01/19/dspace-1-6-scalability-testing/ 
> As the repository grows, the time taken for batch imports to run also 
> increases. Having profiled the importer in our 1.6.0-RC1 install we 
> determined that most (80%-90%) of the time was spent in calls to 
> IndexBrowse.pruneIndexes(). 
> The reason for this is that IndexBrowse.indexItem() calls pruneIndexes(), so 
> every time an item is indexed, the indexes are pruned. For any batch of size 
> n, where n > 1, this is (n - 1) times more than is necessary.
> Increasing the visibility of pruneIndexes(), removing the call from 
> IndexBrowse.indexItem(), and making a single call at the end of the 
> BrowseConsumer.end() method reduces this to once per event queue run.
> However, the batch importer calls Context.commit() after each item is 
> imported. Context.commit() runs the event queue, thus causing one event queue 
> run per imported item. 
> This patch addresses both of these issues in a way which has a minimal effect 
> on the rest of the code base; I don't necessarily consider it to be the 
> "best" way, but I wanted to keep the patch small so it could be put out. What 
> it does is:
> 1. create an IndexBrowse.indexItemNoPrune() method, which is called from the 
> BrowseConsumer class instead of indexItem(). Other calls to indexItem() are 
> not affected.
> 2. Call pruneIndexes() from BrowseConsumer.end()
> 3. Change the call in the batch importer from Context.commit() to 
> Context.getDBConnection.commit(). The only effective difference between the 
> two is that the event queue is not run; I think that a better solution might 
> be to move the code to run the event queue from the Context.commit() method 
> to the Context.complete() method, but I don't know what effect that will have 
> on the rest of the code.
> As noted in Tom's blog post linked above, these changes, on a repository with 
> in excess of 120,000 items, brought import time from 4.7 seconds/item down to 
> 4.9 items/second.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://jira.dspace.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
Dspace-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-devel

[Dspace-devel] [DSJ] Commented: (DS-470) Batch import times increase drastically as repository size increases; patch to mitigate the problem

Reply via email to