> We thought about having one massive file for all this data, BUT, we
archive
> something like 100,000 articles a day.  We store articles for 6 months, so
> this table would very quickly have 18,000,000 rows.  Now, we need to
select
> a group of articles based on source and date.  We also need to select
based
> on url.

Shouldn't be a problem. You can put the whole thing in one table with
indexes on the various columns you'll be selecting on.

In my experience it's best to tokenise the source, date and url information
in a seperate table, thus storing the filename, and three integers.

1st integer - Days since 1st Jan 1980 (for example). If you don't need to
store the full datetime field, then don't bother, just store a bigint.
2nd integer - Contains a reference to the source. You can store the actual
source name in another table, with this reference number against it.
(indexed of course).
3rd integer - Same as #2, but contains a reference number for the url. The
url being held elsewhere.

This will keep the indexes you lookup on nice and small as they store 3
integer values, (make sure you use the correct type of integer, BIGINT
probably).

If you always lookup on all three values, an index on all three is best.

To lookup articles based on url, you check the url_table for the correct
reference number. Then run a query on the big_table looking for that
reference number.

If you have duplicate urls, only store them once in the url_table, as it's
just wasteful otherwise. This also makes for a faster url lookup.

> I can only imagine how long it would take to search 18,000,000 rows for a
> specific url.

Checking the larger table with 18M rows would typically come back in well
under a second on a 200Mhz PC, although you need around 80M index cache. If
all the urls are unique, this table will become pretty large, and you'll
need a fair sized index on that too. I would imagine 300M extra would do the
trick, if you only index the first 14 characters or so.

These figures scale up pretty well, until you run out of memory, when it all
gets a bit sluggish. If that's the case, you can store the different tables
on seperate PCs to speed up retrieval. After all -- you aren't doing any
JOINs.

Where it gets interesting is when you get a request to search all of these
documents a la dejanews. It's actually not too difficult to build a very
fast word search algorithm, but you'll start to deal with proper BIG tables,
with billions of rows...

You also need to factor in how reliable you want the system to be, and what
your acceptable down-time is BTW - Loading tables over 1M records can be a
pain in the arse when you need to recover, and there's no easy way to make
it quicker.

Russ.




---------------------------------------------------------------------
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/           (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Reply via email to