Help with Database structure..
Noah, Sunday, January 20, 2002, 1:55:00 AM, you wrote: NS> The problem is that 20,000 tables can be a bit unwieldy for mysql to NS> handle. I don't think it will be a sore work for MySQL to support more than 20,000 tables, but it's up to a inode limit on the filesystem on your OS. As Russel King correctly remarked, it's probably better to store all your 18M in one table. NS> If you have any ideas, I would love to hear them. NS> -N -- For technical support contracts, goto https://order.mysql.com/ This email is sponsored by Ensita.net http://www.ensita.net/ __ ___ ___ __ / |/ /_ __/ __/ __ \/ /Egor Egorov / /|_/ / // /\ \/ /_/ / /__ [EMAIL PROTECTED] /_/ /_/\_, /___/\___\_\___/ MySQL AB / Ensita.net <___/ www.mysql.com - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
FW: Help with Database structure..
-Original Message- From: Bret Ewin [mailto:[EMAIL PROTECTED]]On Behalf Of Bret Ewin Sent: Monday, January 21, 2002 12:03 PM To: 'Russell King' Subject: RE: Help with Database structure.. You may also want to break the URL into pieces (protocol, domain, path, file) and index each piece. This would allow searches for URLs in the same domain without having to use LIKE, which cannot be indexed. Bret -Original Message- From: Russell King [mailto:[EMAIL PROTECTED]] Sent: Saturday, January 19, 2002 8:01 PM To: [EMAIL PROTECTED] Cc: Noah Silverman Subject: Re: Help with Database structure.. > We thought about having one massive file for all this data, BUT, we archive > something like 100,000 articles a day. We store articles for 6 months, so > this table would very quickly have 18,000,000 rows. Now, we need to select > a group of articles based on source and date. We also need to select based > on url. Shouldn't be a problem. You can put the whole thing in one table with indexes on the various columns you'll be selecting on. In my experience it's best to tokenise the source, date and url information in a seperate table, thus storing the filename, and three integers. 1st integer - Days since 1st Jan 1980 (for example). If you don't need to store the full datetime field, then don't bother, just store a bigint. 2nd integer - Contains a reference to the source. You can store the actual source name in another table, with this reference number against it. (indexed of course). 3rd integer - Same as #2, but contains a reference number for the url. The url being held elsewhere. This will keep the indexes you lookup on nice and small as they store 3 integer values, (make sure you use the correct type of integer, BIGINT probably). If you always lookup on all three values, an index on all three is best. To lookup articles based on url, you check the url_table for the correct reference number. Then run a query on the big_table looking for that reference number. If you have duplicate urls, only store them once in the url_table, as it's just wasteful otherwise. This also makes for a faster url lookup. > I can only imagine how long it would take to search 18,000,000 rows for a > specific url. Checking the larger table with 18M rows would typically come back in well under a second on a 200Mhz PC, although you need around 80M index cache. If all the urls are unique, this table will become pretty large, and you'll need a fair sized index on that too. I would imagine 300M extra would do the trick, if you only index the first 14 characters or so. These figures scale up pretty well, until you run out of memory, when it all gets a bit sluggish. If that's the case, you can store the different tables on seperate PCs to speed up retrieval. After all -- you aren't doing any JOINs. Where it gets interesting is when you get a request to search all of these documents a la dejanews. It's actually not too difficult to build a very fast word search algorithm, but you'll start to deal with proper BIG tables, with billions of rows... You also need to factor in how reliable you want the system to be, and what your acceptable down-time is BTW - Loading tables over 1M records can be a pain in the arse when you need to recover, and there's no easy way to make it quicker. Russ. - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Help with Database structure..
> We thought about having one massive file for all this data, BUT, we archive > something like 100,000 articles a day. We store articles for 6 months, so > this table would very quickly have 18,000,000 rows. Now, we need to select > a group of articles based on source and date. We also need to select based > on url. Shouldn't be a problem. You can put the whole thing in one table with indexes on the various columns you'll be selecting on. In my experience it's best to tokenise the source, date and url information in a seperate table, thus storing the filename, and three integers. 1st integer - Days since 1st Jan 1980 (for example). If you don't need to store the full datetime field, then don't bother, just store a bigint. 2nd integer - Contains a reference to the source. You can store the actual source name in another table, with this reference number against it. (indexed of course). 3rd integer - Same as #2, but contains a reference number for the url. The url being held elsewhere. This will keep the indexes you lookup on nice and small as they store 3 integer values, (make sure you use the correct type of integer, BIGINT probably). If you always lookup on all three values, an index on all three is best. To lookup articles based on url, you check the url_table for the correct reference number. Then run a query on the big_table looking for that reference number. If you have duplicate urls, only store them once in the url_table, as it's just wasteful otherwise. This also makes for a faster url lookup. > I can only imagine how long it would take to search 18,000,000 rows for a > specific url. Checking the larger table with 18M rows would typically come back in well under a second on a 200Mhz PC, although you need around 80M index cache. If all the urls are unique, this table will become pretty large, and you'll need a fair sized index on that too. I would imagine 300M extra would do the trick, if you only index the first 14 characters or so. These figures scale up pretty well, until you run out of memory, when it all gets a bit sluggish. If that's the case, you can store the different tables on seperate PCs to speed up retrieval. After all -- you aren't doing any JOINs. Where it gets interesting is when you get a request to search all of these documents a la dejanews. It's actually not too difficult to build a very fast word search algorithm, but you'll start to deal with proper BIG tables, with billions of rows... You also need to factor in how reliable you want the system to be, and what your acceptable down-time is BTW - Loading tables over 1M records can be a pain in the arse when you need to recover, and there's no easy way to make it quicker. Russ. - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Help with Database structure..
I need some help with a database design question: To summarize our design (I'll spare you a lot of the details.) We collect text files from 20,000 different sources every day. The actual file is stored as a text file. Each source has its own table which contains a list of the text files, date, title, url, etc. We thought about having one massive file for all this data, BUT, we archive something like 100,000 articles a day. We store articles for 6 months, so this table would very quickly have 18,000,000 rows. Now, we need to select a group of articles based on source and date. We also need to select based on url. I can only imagine how long it would take to search 18,000,000 rows for a specific url. Our CURRENT concept is to have one table for each of the 20,000 sources. This way we can just query the matching table for what we need. The problem is that 20,000 tables can be a bit unwieldy for mysql to handle. If you have any ideas, I would love to hear them. Thanks, -N - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php