Re: What's the optimal db design choice for my 400 000 entries?

SGreen Wed, 08 Jun 2005 07:14:38 -0700

"Tommy Svensson \(InfoGrafix\)" <[EMAIL PROTECTED]> wrote on 06/07/2005 
04:49:09 PM:


> Hi all you mysql gurus,

> I have 400 000 unique strings where each and every one of these strings 
are
> associated with 1 - 50 (appr.) integer values.

> Now, pretty simple for you guys I guess, but how will I design my
> database to make a search interface against this data as rapid as 
possible?

> My first guess for a table whould be

> TABLE list
> string     int
> =======    =====
> string1    234
> string1    6323
> string1    343
> string2    313
> string2    9055
> ...
> string434  5445
> string434  12
> ...

> 
> But I come to a grinding halt when I realize this table will be 400 000 
rows
> big, times the sum over the number of associated integers for each 
unique
> string... let's say each unique string always have 10 associated 
integers.
> Then we have a table of 400 000 x 10 = 4 000 000 rows. Is this really 
the
> best approach?

> A search would appropriately look like this

> Search: string434

> and produce the following output

> Result: 5445 12.

> I guess the sql could look like this:

> SELECT int FROM list WHERE string = 'string434'

> What would be the optimal select query if I would like to return ints 
for
> strings that begin with say 'str'?

> And what if I wanted to search for strings where I only know the mid 
part or
> the end of the string? How then would a performance stable sql command 
look?

> If the above design suggestion is the optimal one, then I guess the 
internal
> database workings do not contain duplicates of the string values as they 
do
> in the table...? I'm kinda hoping a binary tree is built or something
> similar. Though I read somewhere that in order for the db to optimize 
and
> create btrees and the like, some data must be indexed... but how? I 
can't
> very well make the string column 'primary' since it isn't...?

> Very, very, very interested in hearing what you pros have to say! Thx a 
lot
> for this forum!

> /Tommy


You are correct in your analysis that you would have four million strings 
in your table if you only created one table. The solution to your problem 
is the design technique called "normalization" which in its simplest 
explanation is the process of preventing data duplication.

Create a table for your strings. Since they are all unique, this table 
will top out at 400000 rows.

CREATE TABLE StringMaster (
        id int unsigned auto_increment primary key
        , stringvalue varchar(255)
        , UNIQUE (stringvalue)
);

And a second table to associate the string values with their 1-50 integer 
values

CREATE TABLE StringInteger (
        StringMaster_ID int unsigned not null
        , IntegerValue int not null
        , PRIMARY KEY(StringMaster_ID, IntegerValue)
        , KEY (IntegerValue, StringMaster_ID)
)

The primary key prevents the same integer value from being assigned to the 
same string more than once (in case you don't want that restriction just 
delete the word PRIMARY from the declaration). The second key is a 
"covering key" or "covering index" to do reverse lookups (tell me all of 
the strings associated with the integer 36). The other index will nearly 
double the storage requirements for this table. The payoff is that it will 
make your lookups VERY fast and it really won't take up that much room as 
integer values only use about 4 bytes per value. 

So let's estimate disk space:

StringMaster:
(255 chars + 4 bytes) * 400000 rows = (appx) 103600000 bytes (around 99 
MB) max size

StringIntegers:
8 bytes * 400000 string values * 10 integers per string = 32000000 bytes 
(about 30 MB) data
plus an additional 30 MB or so for the extra index

Even with another index on StringMaster you are looking at less that 250MB 
total. This is trivial for MySQL to handle (even on a P-II with almost no 
memory) and you should get great results.  There are working MySQL 
databases with BILLIONS of entries containing hundreds of gigabytes of 
data and with the right equipment and design, they are performing very 
nicely.

You asked about how to write a query that searches for only part of a 
string. For that I kindly refer you to the excellent manual. It can tell 
you so much more about the LIKE function and the RLIKE function and all of 
the other things MySQL can do with strings than I have time to go over. 
(It's pretty easy once you actually try it out and get a feel for it)

http://dev.mysql.com/doc/mysql/en/string-comparison-functions.html
http://dev.mysql.com/doc/mysql/en/regexp.html

Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine

Re: What's the optimal db design choice for my 400 000 entries?

Reply via email to