Re: [Dspace-devel] [DSJ] Commented: (DS-470) Batch import timesincrease drastically as repository size increases; patch to mitigate the problem

Kim Shepherd Wed, 17 Feb 2010 17:09:45 -0800

Apologies, it was my fault that the “load” vs “speed” issue came up, I think 
I’ve misquoted Graham there, though it did seem like that’s what a lot of the 
discussion was initially about. The issue came up in a JIRA review at today’s 
developer meeting, which is why we were commenting on it… I guess I should have 
kept my mouth shut :P

From: Graham Triggs [mailto:grahamtri...@gmail.com] 
Sent: Thursday, 18 February 2010 1:03 p.m.
To: Tom De Mulder
Cc: Tim Donohue (JIRA); dspace-devel@lists.sourceforge.net
Subject: Re: [Dspace-devel] [DSJ] Commented: (DS-470) Batch import 
timesincrease drastically as repository size increases;patch to mitigate the 
problem

Hi Tom,

On 17 February 2010 22:09, Tom De Mulder <td...@cam.ac.uk> wrote:

I'd like to point out that this has never been substantiated, and that we

have so far made clear that system load goes UP at the same time as these
batches SLOW DOWN. I don't know where you get this idea from that speeding
up batch times would negatively affect overall system performance.

Can I clarify that I never stated that this particular case increased system 
load.

I had (days ago) made the general point that making a query run faster can make 
it take more resources, causing it to get much worse as the dataset increases 
(which is almost what happened here - although this particular query had 
originally been worked out on Oracle to reduce the resources it takes, and then 
blindly converted to Postgres - which appeared OK initially, but suffers with 
more data and an older or unoptimized Postgres instance).

But that's a general point, not specifically relating to this issue, in order 
to make the case that if you want to demonstrate that this is a SCALABILITY 
improvement, then you have to provide more than just the execution time. Time 
elapsed is performance, and performance is NOT scalability. It may often be the 
case that you simultaneously improve performance and scalability, but it's not 
the case that they will always go hand in hand.

So, I'm not saying that this patch does increase system load. However I do have 
scalability concerns with how this patch is implemented - specifically, how 
many items can be batch imported in one execution? Theoretically, the existing 
importer could load an infinite number of items. This modification WILL run out 
of memory after a finite number of items. How many will depend on the size of 
the metadata.

If you want to deal with an arbitrary size of batch import (as well as 
importing into a large repository), then you are better served following 
Richard's suggestion to simply disable the indexing during batch, and rebuild 
at the end. Which will be more overall load than your modification, but has 
more general suitability (it shouldn't limit the number of items that you can 
process in a single run).

        The best way to reduce system impact here is to reduce the number of 
times
        the indexes get pruned to 1 from N, rather than to do still do them 
(N-1)
        times too many but slightly faster.

I quite agree that it would be good to reduce the number of time the indexes 
are pruned in a batch import, which is why I voted +1 for resolving this 
post-1.6, and with the potential issue of memory usage holding all those items 
in memory, I want to modify the browse code so that we can do an incremental 
re-index - which would mean that you can import all your items without 
indexing, and then at the end simply index just the new (or changed) items, 
with a single prune at the end.

But what I was demonstrating was not to make those queries slightly faster, but 
to make them more efficient - hash operations instead of sorts, few sequence 
scans instead of many loops of index scans. It's not about how fast they are, 
but understanding how they are executed and what impact that has on the system. 
And in doing so, understanding how to install and configure the database so 
that the most efficient execution plans are used.

Because by doing that, we aren't just improving the batch importer. We're 
improving the ingestion of new items via sword. The creation of new items via 
the UI. And we're probably improving general user operations - like browsing of 
items for an author (and/or restricted to a particular collection), which will 
involve joins and will be more efficient if they are using hash operations and 
not sorts.

G

------------------------------------------------------------------------------
Download Intel&reg; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs 
proactively, and fine-tune applications for parallel performance. 
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev

_______________________________________________
Dspace-devel mailing list
Dspace-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-devel

Re: [Dspace-devel] [DSJ] Commented: (DS-470) Batch import timesincrease drastically as repository size increases; patch to mitigate the problem

Reply via email to