[Dspace-devel] [DSJ] Issue Comment Edited: (DS-470) Batch import times increase drastically as repository size increases; patch to mitigate the problem

Simon Brown (JIRA) Wed, 20 Jan 2010 10:11:14 -0800

    [ 
http://jira.dspace.org/jira/browse/DS-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=11090#action_11090
 ]


Simon Brown edited comment on DS-470 at 1/20/10 6:09 PM:
---------------------------------------------------------

I have tried to keep the changes within the context of the event system as much 
as possible; any run of the event queue in which multiple items are indexed for 
browsing should only need to prune the browse indexes once, which is what one 
of the changes I made does.

There are a couple of problems with running index-update separately:

The first is that it's an additional process which needs to be launched 
manually.

The second is that, on a repository of the size of our test repository (at the 
time of this post, around 125,000 items, or around 60% of the size of our live 
repository) index-update takes just under sixteen and a half minutes to run. 
That's roughly the same amount of time as the last batch of 4000 items took to 
import with our patched batch importer. On the assumption that using the 
noindex dispatcher would shave off some of the time spent by our patched 
importer, that still makes the noindex importer + index-update take nearly 
twice as long as the batch importer with the patch. And the time taken to run 
index-update will scale more or less linearly with the number of items in the 
repository.

      was (Author: simes):
    I have tried to keep the changes within the context of the event system as 
much as possible; any run of the event queue in which multiple items are 
indexed for browsing should only need to prune the browse indexes once, which 
is what one of the changes I made does.

There are a couple of problems with running index-update separately:

The first is that it's an additional process which needs to be launched 
manually.

The second is that, on a repository of the size of our test repository (at the 
time of this post, around 125,000 items, or around 60% of the size of our live 
repository) index-update takes just under sixteen and a half minutes to run. 
That's roughly the same amount of time as the last batch of 4000 items took to 
import with our patched batch importer. On the assumption that using the 
noindex dispatcher would shave off some of the time spent by our patched 
importer, that still makes the noindex importer + index-update take nearly 
twice as long as the batch importer with the patch. 
  
> Batch import times increase drastically as repository size increases; patch 
> to mitigate the problem
> ---------------------------------------------------------------------------------------------------
>
>                 Key: DS-470
>                 URL: http://jira.dspace.org/jira/browse/DS-470
>             Project: DSpace 1.x
>          Issue Type: Improvement
>          Components: DSpace API
>    Affects Versions: 1.6.0
>            Reporter: Simon Brown
>            Priority: Minor
>         Attachments: batch_importer_speedup.patch
>
>
> As mentioned by my colleague Tom De Mulder on dspace-tech and at 
> http://tdm27.wordpress.com/2010/01/19/dspace-1-6-scalability-testing/ 
> As the repository grows, the time taken for batch imports to run also 
> increases. Having profiled the importer in our 1.6.0-RC1 install we 
> determined that most (80%-90%) of the time was spent in calls to 
> IndexBrowse.pruneIndexes(). 
> The reason for this is that IndexBrowse.indexItem() calls pruneIndexes(), so 
> every time an item is indexed, the indexes are pruned. For any batch of size 
> n, where n > 1, this is (n - 1) times more than is necessary.
> Increasing the visibility of pruneIndexes(), removing the call from 
> IndexBrowse.indexItem(), and making a single call at the end of the 
> BrowseConsumer.end() method reduces this to once per event queue run.
> However, the batch importer calls Context.commit() after each item is 
> imported. Context.commit() runs the event queue, thus causing one event queue 
> run per imported item. 
> This patch addresses both of these issues in a way which has a minimal effect 
> on the rest of the code base; I don't necessarily consider it to be the 
> "best" way, but I wanted to keep the patch small so it could be put out. What 
> it does is:
> 1. create an IndexBrowse.indexItemNoPrune() method, which is called from the 
> BrowseConsumer class instead of indexItem(). Other calls to indexItem() are 
> not affected.
> 2. Call pruneIndexes() from BrowseConsumer.end()
> 3. Change the call in the batch importer from Context.commit() to 
> Context.getDBConnection.commit(). The only effective difference between the 
> two is that the event queue is not run; I think that a better solution might 
> be to move the code to run the event queue from the Context.commit() method 
> to the Context.complete() method, but I don't know what effect that will have 
> on the rest of the code.
> As noted in Tom's blog post linked above, these changes, on a repository with 
> in excess of 120,000 items, brought import time from 4.7 seconds/item down to 
> 4.9 items/second.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://jira.dspace.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

------------------------------------------------------------------------------
Throughout its 18-year history, RSA Conference consistently attracts the
world's best and brightest in the field, creating opportunities for Conference
attendees to learn about information security's most important issues through
interactions with peers, luminaries and emerging and established companies.
http://p.sf.net/sfu/rsaconf-dev2dev
_______________________________________________
Dspace-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-devel

[Dspace-devel] [DSJ] Issue Comment Edited: (DS-470) Batch import times increase drastically as repository size increases; patch to mitigate the problem

Reply via email to