FW: Help with Database structure..

2002-01-21 Thread Bret Ewin



-Original Message-
From: Bret Ewin [mailto:[EMAIL PROTECTED]]On Behalf Of Bret Ewin
Sent: Monday, January 21, 2002 12:03 PM
To: 'Russell King'
Subject: RE: Help with Database structure..


You may also want to break the URL into pieces (protocol, domain, path,
file) and index each piece. This would allow searches for URLs in the same
domain without having to use LIKE, which cannot be indexed.

Bret

-Original Message-
From: Russell King [mailto:[EMAIL PROTECTED]]
Sent: Saturday, January 19, 2002 8:01 PM
To: [EMAIL PROTECTED]
Cc: Noah Silverman
Subject: Re: Help with Database structure..


 We thought about having one massive file for all this data, BUT, we
archive
 something like 100,000 articles a day.  We store articles for 6
months, so
 this table would very quickly have 18,000,000 rows.  Now, we need to
select
 a group of articles based on source and date.  We also need to select
based
 on url.

Shouldn't be a problem. You can put the whole thing in one table with
indexes on the various columns you'll be selecting on.

In my experience it's best to tokenise the source, date and url
information
in a seperate table, thus storing the filename, and three integers.

1st integer - Days since 1st Jan 1980 (for example). If you don't need
to
store the full datetime field, then don't bother, just store a bigint.
2nd integer - Contains a reference to the source. You can store the
actual
source name in another table, with this reference number against it.
(indexed of course).
3rd integer - Same as #2, but contains a reference number for the url.
The
url being held elsewhere.

This will keep the indexes you lookup on nice and small as they store 3
integer values, (make sure you use the correct type of integer, BIGINT
probably).

If you always lookup on all three values, an index on all three is best.

To lookup articles based on url, you check the url_table for the correct
reference number. Then run a query on the big_table looking for that
reference number.

If you have duplicate urls, only store them once in the url_table, as
it's
just wasteful otherwise. This also makes for a faster url lookup.

 I can only imagine how long it would take to search 18,000,000 rows
for a
 specific url.

Checking the larger table with 18M rows would typically come back in
well
under a second on a 200Mhz PC, although you need around 80M index cache.
If
all the urls are unique, this table will become pretty large, and you'll
need a fair sized index on that too. I would imagine 300M extra would do
the
trick, if you only index the first 14 characters or so.

These figures scale up pretty well, until you run out of memory, when it
all
gets a bit sluggish. If that's the case, you can store the different
tables
on seperate PCs to speed up retrieval. After all -- you aren't doing any
JOINs.

Where it gets interesting is when you get a request to search all of
these
documents a la dejanews. It's actually not too difficult to build a very
fast word search algorithm, but you'll start to deal with proper BIG
tables,
with billions of rows...

You also need to factor in how reliable you want the system to be, and
what
your acceptable down-time is BTW - Loading tables over 1M records can be
a
pain in the arse when you need to recover, and there's no easy way to
make
it quicker.

Russ.




-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail
[EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Help with Database structure..

2002-01-21 Thread Egor Egorov

Noah,

Sunday, January 20, 2002, 1:55:00 AM, you wrote:

NS The problem is that 20,000 tables can be a  bit unwieldy for mysql to
NS handle.

I don't think it will be a sore work for MySQL to support more than 20,000
tables, but it's up to a inode limit on the filesystem on your OS.

As Russel King correctly remarked, it's probably better to store all your
18M in one table.

NS If you have any ideas, I would love to hear them.

NS -N


-- 
For technical support contracts, goto https://order.mysql.com/
This email is sponsored by Ensita.net http://www.ensita.net/
   __  ___ ___   __
  /  |/  /_ __/ __/ __ \/ /Egor Egorov
 / /|_/ / // /\ \/ /_/ / /__   [EMAIL PROTECTED]
/_/  /_/\_, /___/\___\_\___/   MySQL AB / Ensita.net
   ___/   www.mysql.com



-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Help with Database structure..

2002-01-19 Thread Noah Silverman

I need some help with a database design  question:


To summarize our design (I'll spare you a lot of the details.)

We collect text files from 20,000 different sources every day.  The actual
file is stored as a text file.  Each source has its own table which contains
a list of the text files, date, title, url, etc.

We thought about having one massive file for all this data, BUT, we archive
something like 100,000 articles a day.  We store articles for 6 months, so
this table would very quickly have 18,000,000 rows.  Now, we need to select
a group of articles based on source and date.  We also need to select based
on url.

I can only imagine how long it would take to search 18,000,000 rows for a
specific url.


Our CURRENT concept is to have one table for each of the 20,000 sources.
This way we can just query the matching table for what we need.

The problem is that 20,000 tables can be a  bit unwieldy for mysql to
handle.


If you have any ideas, I would love to hear them.


Thanks,

-N


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Help with Database structure..

2002-01-19 Thread Russell King

 We thought about having one massive file for all this data, BUT, we
archive
 something like 100,000 articles a day.  We store articles for 6 months, so
 this table would very quickly have 18,000,000 rows.  Now, we need to
select
 a group of articles based on source and date.  We also need to select
based
 on url.

Shouldn't be a problem. You can put the whole thing in one table with
indexes on the various columns you'll be selecting on.

In my experience it's best to tokenise the source, date and url information
in a seperate table, thus storing the filename, and three integers.

1st integer - Days since 1st Jan 1980 (for example). If you don't need to
store the full datetime field, then don't bother, just store a bigint.
2nd integer - Contains a reference to the source. You can store the actual
source name in another table, with this reference number against it.
(indexed of course).
3rd integer - Same as #2, but contains a reference number for the url. The
url being held elsewhere.

This will keep the indexes you lookup on nice and small as they store 3
integer values, (make sure you use the correct type of integer, BIGINT
probably).

If you always lookup on all three values, an index on all three is best.

To lookup articles based on url, you check the url_table for the correct
reference number. Then run a query on the big_table looking for that
reference number.

If you have duplicate urls, only store them once in the url_table, as it's
just wasteful otherwise. This also makes for a faster url lookup.

 I can only imagine how long it would take to search 18,000,000 rows for a
 specific url.

Checking the larger table with 18M rows would typically come back in well
under a second on a 200Mhz PC, although you need around 80M index cache. If
all the urls are unique, this table will become pretty large, and you'll
need a fair sized index on that too. I would imagine 300M extra would do the
trick, if you only index the first 14 characters or so.

These figures scale up pretty well, until you run out of memory, when it all
gets a bit sluggish. If that's the case, you can store the different tables
on seperate PCs to speed up retrieval. After all -- you aren't doing any
JOINs.

Where it gets interesting is when you get a request to search all of these
documents a la dejanews. It's actually not too difficult to build a very
fast word search algorithm, but you'll start to deal with proper BIG tables,
with billions of rows...

You also need to factor in how reliable you want the system to be, and what
your acceptable down-time is BTW - Loading tables over 1M records can be a
pain in the arse when you need to recover, and there's no easy way to make
it quicker.

Russ.




-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php