> We thought about having one massive file for all this data, BUT, we archive > something like 100,000 articles a day. We store articles for 6 months, so > this table would very quickly have 18,000,000 rows. Now, we need to select > a group of articles based on source and date. We also need to select based > on url.
Shouldn't be a problem. You can put the whole thing in one table with indexes on the various columns you'll be selecting on. In my experience it's best to tokenise the source, date and url information in a seperate table, thus storing the filename, and three integers. 1st integer - Days since 1st Jan 1980 (for example). If you don't need to store the full datetime field, then don't bother, just store a bigint. 2nd integer - Contains a reference to the source. You can store the actual source name in another table, with this reference number against it. (indexed of course). 3rd integer - Same as #2, but contains a reference number for the url. The url being held elsewhere. This will keep the indexes you lookup on nice and small as they store 3 integer values, (make sure you use the correct type of integer, BIGINT probably). If you always lookup on all three values, an index on all three is best. To lookup articles based on url, you check the url_table for the correct reference number. Then run a query on the big_table looking for that reference number. If you have duplicate urls, only store them once in the url_table, as it's just wasteful otherwise. This also makes for a faster url lookup. > I can only imagine how long it would take to search 18,000,000 rows for a > specific url. Checking the larger table with 18M rows would typically come back in well under a second on a 200Mhz PC, although you need around 80M index cache. If all the urls are unique, this table will become pretty large, and you'll need a fair sized index on that too. I would imagine 300M extra would do the trick, if you only index the first 14 characters or so. These figures scale up pretty well, until you run out of memory, when it all gets a bit sluggish. If that's the case, you can store the different tables on seperate PCs to speed up retrieval. After all -- you aren't doing any JOINs. Where it gets interesting is when you get a request to search all of these documents a la dejanews. It's actually not too difficult to build a very fast word search algorithm, but you'll start to deal with proper BIG tables, with billions of rows... You also need to factor in how reliable you want the system to be, and what your acceptable down-time is BTW - Loading tables over 1M records can be a pain in the arse when you need to recover, and there's no easy way to make it quicker. Russ. --------------------------------------------------------------------- Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php