"Tommy Svensson \(InfoGrafix\)" <[EMAIL PROTECTED]> wrote on 06/07/2005 04:49:09 PM:
> Hi all you mysql gurus, > I have 400 000 unique strings where each and every one of these strings are > associated with 1 - 50 (appr.) integer values. > Now, pretty simple for you guys I guess, but how will I design my > database to make a search interface against this data as rapid as possible? > My first guess for a table whould be > TABLE list > string int > ======= ===== > string1 234 > string1 6323 > string1 343 > string2 313 > string2 9055 > ... > string434 5445 > string434 12 > ... > > But I come to a grinding halt when I realize this table will be 400 000 rows > big, times the sum over the number of associated integers for each unique > string... let's say each unique string always have 10 associated integers. > Then we have a table of 400 000 x 10 = 4 000 000 rows. Is this really the > best approach? > A search would appropriately look like this > Search: string434 > and produce the following output > Result: 5445 12. > I guess the sql could look like this: > SELECT int FROM list WHERE string = 'string434' > What would be the optimal select query if I would like to return ints for > strings that begin with say 'str'? > And what if I wanted to search for strings where I only know the mid part or > the end of the string? How then would a performance stable sql command look? > If the above design suggestion is the optimal one, then I guess the internal > database workings do not contain duplicates of the string values as they do > in the table...? I'm kinda hoping a binary tree is built or something > similar. Though I read somewhere that in order for the db to optimize and > create btrees and the like, some data must be indexed... but how? I can't > very well make the string column 'primary' since it isn't...? > Very, very, very interested in hearing what you pros have to say! Thx a lot > for this forum! > /Tommy You are correct in your analysis that you would have four million strings in your table if you only created one table. The solution to your problem is the design technique called "normalization" which in its simplest explanation is the process of preventing data duplication. Create a table for your strings. Since they are all unique, this table will top out at 400000 rows. CREATE TABLE StringMaster ( id int unsigned auto_increment primary key , stringvalue varchar(255) , UNIQUE (stringvalue) ); And a second table to associate the string values with their 1-50 integer values CREATE TABLE StringInteger ( StringMaster_ID int unsigned not null , IntegerValue int not null , PRIMARY KEY(StringMaster_ID, IntegerValue) , KEY (IntegerValue, StringMaster_ID) ) The primary key prevents the same integer value from being assigned to the same string more than once (in case you don't want that restriction just delete the word PRIMARY from the declaration). The second key is a "covering key" or "covering index" to do reverse lookups (tell me all of the strings associated with the integer 36). The other index will nearly double the storage requirements for this table. The payoff is that it will make your lookups VERY fast and it really won't take up that much room as integer values only use about 4 bytes per value. So let's estimate disk space: StringMaster: (255 chars + 4 bytes) * 400000 rows = (appx) 103600000 bytes (around 99 MB) max size StringIntegers: 8 bytes * 400000 string values * 10 integers per string = 32000000 bytes (about 30 MB) data plus an additional 30 MB or so for the extra index Even with another index on StringMaster you are looking at less that 250MB total. This is trivial for MySQL to handle (even on a P-II with almost no memory) and you should get great results. There are working MySQL databases with BILLIONS of entries containing hundreds of gigabytes of data and with the right equipment and design, they are performing very nicely. You asked about how to write a query that searches for only part of a string. For that I kindly refer you to the excellent manual. It can tell you so much more about the LIKE function and the RLIKE function and all of the other things MySQL can do with strings than I have time to go over. (It's pretty easy once you actually try it out and get a feel for it) http://dev.mysql.com/doc/mysql/en/string-comparison-functions.html http://dev.mysql.com/doc/mysql/en/regexp.html Shawn Green Database Administrator Unimin Corporation - Spruce Pine