Re: [Dspace-tech] DSpace 1.4.2 Browse performance issues

Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] Mon, 11 Aug 2008 12:57:29 -0700

Hi Graham,
     Just wanted to let you know I took your advice - I created a
materialized view and changed Browse.java to access the new view instead
of itemsbyauthor (in the case of browse-items-by-author) and our
response times went from 25+ seconds to sub-second response times.
Thanks a bunch Graham!!!
Best,
Sue


-----Original Message-----
From: Graham Triggs [mailto:[EMAIL PROTECTED] 
Sent: Friday, August 08, 2008 3:11 AM
To: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
Cc: dspace-tech@lists.sourceforge.net; Smail, James W. (LARC-B702)[NCI
INFORMATION SYSTEMS]
Subject: Re: [Dspace-tech] DSpace 1.4.2 Browse performance issues

Sue,

You are right both about the performance of browse in 1.4.2, and the 
problems that you will have trying to limit the number of rows 
considered by the query.

The real problem with the query is the use of the DISTINCT view (note 
that the 'where sort_author' part outside of the view in your rewritten 
query shouldn't make any difference - the improvement is all coming from

restricting the number of rows that are considered in the DISTINCT
query).

Now, the good news as far as DSpace is concerned is that all of this has

been completely replaced in 1.5, and retrieving the author list does not

use similar DISTINCT queries.

I don't know what amount of customisations you have, but it's likely 
that upgrading to 1.5 would be the path of least resistance, and you'll 
gain a number of additional new features.

If you must stick with 1.4.2, then I would suggest that you create a 
'materialized view' in postgres that just contains the distinct rows 
from ItemsByAuthor, and replace the '(SELECT DISTINCT * from 
ItemsByAuthor) distinct_view' with the materialized view.

This page:

http://jonathangardner.net/tech/w/PostgreSQL/Materialized_Views

contains information on how you can create a materialized view / table 
using triggers (probably shouldn't use the entirely general purpose 
triggers presented there - make them specific to handling the 
ItemsByAuthor table, and ensure that you have an index on sort_author on

the created table).

G

Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote:
>
> We are experiencing very poor performance with our DSpace Browses 
> (DSpace 1.4.2/postgreSQL 8.2/Solaris 10). I've spent the past day 
> looking at code and experimenting with queries and explains and I've 
> got a good idea of a sql query modification I could make to 
> drastically improve performance - I'm not just quite sure how to 
> accomplish it without disrupting other DSpace code and functionality. 
> Here's an example:
>
> I've been looking at the */itemsbyauthor /*table. Ours contains 94,401

> rows. Here is the DSpace query that clicking on the 
> browse-items-by-author link generates:
>
> *select distinct author, sort_author*
>
> * from (SELECT DISTINCT * from ItemsByAuthor ) distinct_view*
>
> * order by sort_author LIMIT 21*
>
> * *
>
> and here is the Explain from executing the query against our database:
>
> "Limit (cost=28209.73..28211.30 rows=21 width=64) (actual 
> time=8382.920..8384.061 rows=21 loops=1)"
>
> " -> Unique (cost=28209.73..28917.73 rows=9441 width=64) (actual 
> time=8382.915..8384.045 rows=21 loops=1)"
>
> " -> Sort (cost=28209.73..28445.73 rows=94401 width=64) (actual 
> time=8382.910..8383.992 rows=56 loops=1)"
>
> " Sort Key: sort_author, author"
>
> " -> Subquery Scan distinct_view (cost=14410.61..16534.63 rows=94401 
> width=64) (actual time=385.620..671.460 rows=94401 loops=1)"
>
> " -> Unique (cost=14410.61..15590.62 rows=94401 width=42) (actual 
> time=385.611..603.050 rows=94401 loops=1)"
>
> " -> Sort (cost=14410.61..14646.61 rows=94401 width=42) (actual 
> time=385.605..504.765 rows=94401 loops=1)"
>
> " Sort Key: items_by_author_id, item_id, author, sort_author"
>
> " -> Seq Scan on itemsbyauthor (cost=0.00..3383.01 rows=94401 
> width=42) (actual time=19.232..113.700 rows=94401 loops=1)"
>
> "Total runtime: 8384.581 ms"
>
> Basically, a Sequential Scan is being done against the itemsbyauthor 
> table and the index is not being used. (By the way, the response time 
> was around 25 seconds last night prior to running a vacuum and 
> reindex! - Terrible!!)
>
> Here's my idea: if I could somehow capture an alphabetic range to use 
> in a where clause, based on what the User clicks on, I could add a 
> "where" clause to the query which would force postgreSQL to use the 
> index, "sort_author_idx", and the elapsed time that it would take to 
> execute the query would be a fraction of what it's taking now. For 
> instance, if I knew the User was starting the browse from the letter 
> "B" (I know this is oversimplifying, but you'll get the idea), I could

> structure the query as follows:
>
> *select distinct author,sort_author from (SELECT DISTINCT * from 
> ItemsByAuthor where sort_author > 'a' and sort_author < 'c' ) 
> distinct_view *
>
> *where sort_author > 'a' and sort_author < 'c'*
>
> *order by sort_author LIMIT 21*
>
> Look at the explain from this query, below and just look at the 
> difference in performance! From 8.4 seconds to 1.3 seconds!!
>
> "Limit (cost=6544.60..6546.17 rows=21 width=64) (actual 
> time=1253.393..1253.710 rows=21 loops=1)"
>
> " -> Unique (cost=6544.60..6650.94 rows=1418 width=64) (actual 
> time=1253.388..1253.696 rows=21 loops=1)"
>
> " -> Sort (cost=6544.60..6580.04 rows=14179 width=64) (actual 
> time=1253.382..1253.641 rows=56 loops=1)"
>
> " Sort Key: sort_author, author"
>
> " -> Subquery Scan distinct_view (cost=4663.32..4982.35 rows=14179 
> width=64) (actual time=113.859..145.529 rows=11463 loops=1)"
>
> " -> Unique (cost=4663.32..4840.56 rows=14179 width=42) (actual 
> time=113.852..137.697 rows=11463 loops=1)"
>
> " -> Sort (cost=4663.32..4698.77 rows=14179 width=42) (actual 
> time=113.848..126.454 rows=11463 loops=1)"
>
> " Sort Key: items_by_author_id, item_id, author, sort_author"
>
> " -> Bitmap Heap Scan on itemsbyauthor (cost=476.50..3199.08 
> rows=14179 width=42) (actual time=71.693..80.927 rows=11463 loops=1)"
>
> " Recheck Cond: ((sort_author > 'a'::text) AND (sort_author < 
> 'c'::text) AND (sort_author > 'a'::text) AND (sort_author <
'c'::text))"
>
> " -> Bitmap Index Scan on sort_author_idx (cost=0.00..472.95 
> rows=14179 width=0) (actual time=71.078..71.078 rows=11463 loops=1)"
>
> " Index Cond: ((sort_author > 'a'::text) AND (sort_author < 'c'::text)

> AND (sort_author > 'a'::text) AND (sort_author < 'c'::text))"
>
> "Total runtime: 1254.158 ms"
>
> Here you see an index scan is used as opposed to the sequential scan 
> and, of course, it's much faster.
>
> Basically, what I'm trying to figure out is:
>
>    1. Where in the code to make the modification (jsp? One of the
>       browse programs?)
>    2. How to determine how to structure the "where" clause based on
>       what the user clicks on
>          1. Jump to?
>          2. Enter the first few letters?
>          3. Next page?
>          4. Previous page?
>
> If any of the developers (or anyone else who knows DSpace code inside 
> and out) are reading this and could help me out with this, I would 
> really appreciate it.
>
> Thanks in advance.
>
> Best,
>
> Sue
>
> Sue Walker-Thornton
>
> ConITS Contract
> NASA Langley Research Center
> Integrated Library Systems Application/Database Administrator
>
> 130 Research Drive
>
> Hampton, VA 23666
>
> Office: (757) 224-4074
> Fax: (757) 224-4001
> Pager: (757) 988-2547
> Email: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
>
>
------------------------------------------------------------------------
>
>
------------------------------------------------------------------------
-
> This SF.Net email is sponsored by the Moblin Your Move Developer's
challenge
> Build the coolest Linux based applications with Moblin SDK&  win great
prizes
> Grand prize is a trip for two to an Open Source event anywhere in the
world
> http://moblin-contest.org/redirect.php?banner_id=100&url=/
>
------------------------------------------------------------------------
>
> _______________________________________________
> DSpace-tech mailing list
> DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>    

This email has been scanned by Postini.
For more information please visit http://www.postini.com


-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Re: [Dspace-tech] DSpace 1.4.2 Browse performance issues

Reply via email to