On Wed, Jun 17, 2009 at 03:35:14PM +0100, Andrew Marlow wrote:
> I am working on a proprietary digital library whilst at the same time
> considering how dspace might have been used to solve the same problems (it
> won't be, but that's another story). When I consider usage event reporting
> there are some concerns that arise when the number of articles and visitors
> is very large. The current site has around 6 million articles and roughly 20
> million hits per day. With these sorts of volumes, weblogs are IMHO not the
> way to go. Also there are problems of scale when using event logs. Basically
> a file-based approach is only suitable for small volumes of data.

Absolutely.  The event sink classes that come with DSpace are samples,
not meant for serious production use.  The class we use here is built
around an RDBMS.  (But we haven't even dreamed of 20 million hits/day.)

> I considered using a RDBMS and this goes get you further but unfortunately,
> not far enough. A RDBMS can cope with millions of rows but starts to
> struggle when you reach tens of millions or hundreds of millions. Let's do
> some maths. In these calculations, there is a requirement to produce year to
> date (YTD) figures (this is a requirement of COUNTER). I will assume that a
> RDBMS system will calculate the YTD, rather than store a running total when
> the current month is processed. This means that figures for 12 months needs
> to be retained. Of those 20 million hits, some will be for the same
> article(s). So after article level aggregation has been performed there will
> be a maximum of 6 million rows for one day. This is 180 million for one
> month, 2,160 million for 12 months. Now 2 billion rows seems a bit on the
> large side to me :-)

OLTP systems cope with billions of rows/day giving reasonable
performance, but your requirements may be quite different.  They
probably journal that stuff immediately and post it into tables later,
like the guy with the green eyeshade used to do on paper.

> One way around this would be to have a table for each month. Thus a table
> might have to cope with 180 million rows, which is managable, even though it
> is large. But in calculating the YTD figures one would need to do a 12 table
> join. That's a bit unwieldy.

How quickly do you have to do those calculations?  I've been given to
understand that another requirement of COUNTER is auditing, so we are
not talking subsecond response times here. :-)

Once a body of data (say, last month, or even yesterday) can be
considered static, it can be rolled up and the sums dropped into much
smaller summary tables for quick grand-totalling.  The rollup jobs can
run at low priority whenever it's convenient.  Six million rows plus
up to eleven plus up to thirty doesn't sound quite so daunting.  In
capturing high-volume data you need to winkle out every opportunity to
aggregate.

Have you benchmarked totalling a 2-giga-row column?  How long did it
take to sum it?  How long did it take to add the final row?  What did
your system load monitoring tools show?  If you include a column of
texts randomly selected from a fixed set, how well does it perform on
subsets of realistic size?  Are you better off indexing this "title"
column (and paying the price for index maintenance with each INSERT)
or not (and doing sequential scans instead of index scans)?  Does it
help to place the "titles" in a moderate-sized lookaside table, look
them up, and use the corresponding serial number in detail rows?  Can
you improve performance with preallocated storage?  Can you win with
splitting tables up by columns (not forgetting their indices) and put
them in multiple tablespaces on multiple drives?  Do you need to move
indices to a SSD?  (PostgreSQL can't split single tables across
tablespaces, but can place indices and tables individually.)  Do you
really need to go this far?

In high-performance systems you really have to try different ideas and
compare their real-world performance -- theory gets you only so far.

-- 
Mark H. Wood, Lead System Programmer   mw...@iupui.edu
Friends don't let friends publish revisable-form documents.

Attachment: pgpn0Vol3ppwb.pgp
Description: PGP signature

------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables unlimited
royalty-free distribution of the report engine for externally facing 
server and web deployment.
http://p.sf.net/sfu/businessobjects
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to