[Dspace-devel] [DSJ] Updated: (DS-470) Batch import times increase drastically as repository size increases; patch to mitigate the problem

Tim Donohue (JIRA) Wed, 27 Jan 2010 13:36:29 -0800

     [ 
http://jira.dspace.org/jira/browse/DS-470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tim Donohue updated DS-470:
---------------------------

    Fix Version/s: 1.6.1

We discussed this issue and the proposed patch at today's DSpace Developers 
meeting.  See the IRC logs for the full discussion (search for DS-470):
http://www.duraspace.org/irclogs/index.php?date=2010-01-27

The general consensus was that this may need a bit more discussion about how 
best to resolve these issues.  It's good there's a patch available for folks 
encountering this problem. But there are some concerns as to the patch 
implementation, and whether there would be ways to make this pruning "smarter" 
in general.  

There's also a balance between "reducing load" versus "reducing time".  From my 
understanding (someone correct me if I'm wrong), the importer is tailored to 
try to reduce the load on your system (so that people browsing the UI don't 
experience much of a slowdown), rather than trying to import items in a very 
quick fashion. 

But, that being said, most of us agree we need to rethink this, and figure out 
a good resolution to the problem.  It's definitely not good to have item import 
time increase in a rapid manner based on the number of total items in the 
system (though some small increase in time may always occur).

So, for the time being, we're scheduling this for a potential 1.6.1 release -- 
to allow for more discussion and figure out a proper way to fix it.  In the 
meantime, the small patch that Simon added above is a way to fix this for those 
in desperate need for a quick fix.  Thanks Simon for locating this problem and 
finding a quick fix!

> Batch import times increase drastically as repository size increases; patch 
> to mitigate the problem
> ---------------------------------------------------------------------------------------------------
>
>                 Key: DS-470
>                 URL: http://jira.dspace.org/jira/browse/DS-470
>             Project: DSpace 1.x
>          Issue Type: Improvement
>          Components: DSpace API
>    Affects Versions: 1.6.0
>            Reporter: Simon Brown
>            Priority: Minor
>             Fix For: 1.6.1
>
>         Attachments: batch_importer_speedup.patch
>
>
> As mentioned by my colleague Tom De Mulder on dspace-tech and at 
> http://tdm27.wordpress.com/2010/01/19/dspace-1-6-scalability-testing/ 
> As the repository grows, the time taken for batch imports to run also 
> increases. Having profiled the importer in our 1.6.0-RC1 install we 
> determined that most (80%-90%) of the time was spent in calls to 
> IndexBrowse.pruneIndexes(). 
> The reason for this is that IndexBrowse.indexItem() calls pruneIndexes(), so 
> every time an item is indexed, the indexes are pruned. For any batch of size 
> n, where n > 1, this is (n - 1) times more than is necessary.
> Increasing the visibility of pruneIndexes(), removing the call from 
> IndexBrowse.indexItem(), and making a single call at the end of the 
> BrowseConsumer.end() method reduces this to once per event queue run.
> However, the batch importer calls Context.commit() after each item is 
> imported. Context.commit() runs the event queue, thus causing one event queue 
> run per imported item. 
> This patch addresses both of these issues in a way which has a minimal effect 
> on the rest of the code base; I don't necessarily consider it to be the 
> "best" way, but I wanted to keep the patch small so it could be put out. What 
> it does is:
> 1. create an IndexBrowse.indexItemNoPrune() method, which is called from the 
> BrowseConsumer class instead of indexItem(). Other calls to indexItem() are 
> not affected.
> 2. Call pruneIndexes() from BrowseConsumer.end()
> 3. Change the call in the batch importer from Context.commit() to 
> Context.getDBConnection.commit(). The only effective difference between the 
> two is that the event queue is not run; I think that a better solution might 
> be to move the code to run the event queue from the Context.commit() method 
> to the Context.complete() method, but I don't know what effect that will have 
> on the rest of the code.
> As noted in Tom's blog post linked above, these changes, on a repository with 
> in excess of 120,000 items, brought import time from 4.7 seconds/item down to 
> 4.9 items/second.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://jira.dspace.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
Dspace-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-devel

[Dspace-devel] [DSJ] Updated: (DS-470) Batch import times increase drastically as repository size increases; patch to mitigate the problem

Reply via email to