[
http://jira.dspace.org/jira/browse/DS-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=11105#action_11105
]
Graham Triggs commented on DS-470:
----------------------------------
I'm looking at the possibility of having the indexer determine whether it needs
to call pruneIndexes on a per item basis (in theory, it shouldn't need to if it
is only adding new data). Or alternatively, treat the indexing as a
transactional process - ie. explicit start() / inde() / index() / commit()
operations.
pruneIndexes() is very much an internal implementation detail of how the
indexer updates the browse tables, and as such it really should not be exposed
as part of the public API.
> Batch import times increase drastically as repository size increases; patch
> to mitigate the problem
> ---------------------------------------------------------------------------------------------------
>
> Key: DS-470
> URL: http://jira.dspace.org/jira/browse/DS-470
> Project: DSpace 1.x
> Issue Type: Improvement
> Components: DSpace API
> Affects Versions: 1.6.0
> Reporter: Simon Brown
> Priority: Minor
> Attachments: batch_importer_speedup.patch
>
>
> As mentioned by my colleague Tom De Mulder on dspace-tech and at
> http://tdm27.wordpress.com/2010/01/19/dspace-1-6-scalability-testing/
> As the repository grows, the time taken for batch imports to run also
> increases. Having profiled the importer in our 1.6.0-RC1 install we
> determined that most (80%-90%) of the time was spent in calls to
> IndexBrowse.pruneIndexes().
> The reason for this is that IndexBrowse.indexItem() calls pruneIndexes(), so
> every time an item is indexed, the indexes are pruned. For any batch of size
> n, where n > 1, this is (n - 1) times more than is necessary.
> Increasing the visibility of pruneIndexes(), removing the call from
> IndexBrowse.indexItem(), and making a single call at the end of the
> BrowseConsumer.end() method reduces this to once per event queue run.
> However, the batch importer calls Context.commit() after each item is
> imported. Context.commit() runs the event queue, thus causing one event queue
> run per imported item.
> This patch addresses both of these issues in a way which has a minimal effect
> on the rest of the code base; I don't necessarily consider it to be the
> "best" way, but I wanted to keep the patch small so it could be put out. What
> it does is:
> 1. create an IndexBrowse.indexItemNoPrune() method, which is called from the
> BrowseConsumer class instead of indexItem(). Other calls to indexItem() are
> not affected.
> 2. Call pruneIndexes() from BrowseConsumer.end()
> 3. Change the call in the batch importer from Context.commit() to
> Context.getDBConnection.commit(). The only effective difference between the
> two is that the event queue is not run; I think that a better solution might
> be to move the code to run the event queue from the Context.commit() method
> to the Context.complete() method, but I don't know what effect that will have
> on the rest of the code.
> As noted in Tom's blog post linked above, these changes, on a repository with
> in excess of 120,000 items, brought import time from 4.7 seconds/item down to
> 4.9 items/second.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://jira.dspace.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
Dspace-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-devel