Hello John, Friday, June 7, 2002, 1:54:52 AM, you wrote:
JK> Num refers to the text file from which the data is read. It contains the count of JK> records shown in the Records column, followed by the total number read in to that point. JK> The time in seconds to insert these records is under the Time column. And the rightmost JK> column shows the average insertion rate. So far the speed is pretty impressive. But I JK> have a LOT of data to index. Look at how the performance nosedives. JK> 020 1353759 8377979 1172.17 1154.9 JK> 030 2147433 27184976 3859.09 556.5 JK> 040 2170491 49124176 6384.61 340.0 JK> 050 1964395 69808672 8363.19 234.9 JK> 060 1675533 87925330 9651.06 173.6 JK> 070 1359195 102944321 10037.75 135.4 JK> 080 1053478 114868539 9628.03 109.4 JK> 090 747348 123503148 8278.75 90.3 JK> 100 553939 129876045 13612.53 40.7 JK> At this point, with 130 million record, the size of data table and index are about 28GB. JK> Because I knew the files would exceed 4GB I set it up for "big times" with an ALTER JK> TABLE command, AVG_ROW_LENGTH = 512, and MAX_ROWS = 64x2^30. That provides ample room JK> for growth, but maybe it negatively affecs indexing speeds? Of course 64bit data pointers will make index and data file a bit slower and large, but unfortunately you have not much choice here. You may try creating table with PACK_KEYS=1 which may give you a bit of performance (it may produce minor slowdown as well depending on the application so you need to check it) There are many speed issues with INSERT statement speed - please take a look at appropriate section in MySQL manual. JK> To populate the database I wrote a python script that reads each line of the input file JK> and issues this SQL command: JK> INSERT IGNORE INTO table (name, num, val) VALUES (a,b,c) Why are you using ignore ? Do you have duplicate rows ? Also Multiple values inserts are recommended (will give much faster speed) As well you may wish to create your table with DELAY_KEY_WRITES=1 or use LOCK TABLES/UNLOCK TABLES around the series of inserts in this table - this will not force MySQL to flush modified key blocks under each statement. And the fastest bulk data load method for MySQL is "LOAD DATA". Please take a look if this will work for you. JK> And the corresponding table definitions is as follows. (I've simplified this example. JK> There are other columns defined that are not updated in this processing stage.) JK> CREATE TABLE table JK> (id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, JK> name VARCHAR(120) NOT NULL, JK> num SMALLINT NOT NULL, JK> val TINYINT UNSIGNED NOT NULL, JK> time TIMESTAMP NOT NULL, JK> INDEX(NAME), INDEX(num), UNIQUE(name)) JK> I'm indexing two columns in addition to the primary key, which is defined as a BIGINT. JK> The name string averages 80 characters and is constrained to be unique. 1) You have created 2 indexes on NAME field which of course reduces speed. Please drop INDEX(NAME) as you have UNIQUE(NAME) 2) As you have Average length of name more then 60% of maximum length it's faster to use CHAR() column type and static length rows. This will have moderate overhead but will improve speed. JK> In contrast, the JK> num column is densely packed -- its values range from 1-100 over 130 million rows. JK> Given all this, where do you suppose the bottleneck lies? Clearly, the btree indexes are JK> getting large, with many disk hits required to find a record. But more specifically, is JK> there an alternative to the layout that could alleviate the problem? Unfortunately MySQL does not currently support other index types for MYISAM tables. Most of optimizations you may do to reduce index size are mentioned above. Also you should consider increasing key_buffer size so more index access will be cached. Which one you currently have and what is the total system memory size ? Of course you will be unable to have key_buffer which will cover all the index but with Btrees the most important to cover the topmost levels of the tree which is possible. Thus this will still require some physical IO for large volumes. JK> For example, not JK> declaring UNIQUE(name) and instead placing the responsibility of the guarantee outside JK> of MySql. Yes. The index on name field is the most space consuming index and dropping it at all will improve speed because of higher key_cache hit rate. Depending on which select queries you have possible solution is adding special CHECKSUM field which will contain INT/BIGINT value corresponding to some sort of checksum for the name field. Of course you can't rely on it to be unique but this index will be raver fast for "ref" name lookup. Of course it will not help range queries. JK> Or, does creating tables capable of growing past 4GB carry an inherent JK> penalty? Or are the non-word-size integers slowing things down with the overhead of bit JK> packing? The great issue with your application is disk IO. So these issues do not matter such a lot. One possible optimization solution for your application may be to have series of tables so each one will be small enough to have index fitting into memory with MERGE table across them. This will increase insert speed a lot but slowdown select statements a bit. Also you will have to handle auto_increment and unique manually. JK> Not having much experience the with MySql product, it's hard for me to finger JK> the likely suspect. As another idea, would it be advisable to insert all the data into JK> an un-indexed table (except for the primary key) and then perform one large indexing JK> operation after all the data is loaded? For one time bulk load "LOAD DATA" is the best solution. It behaves almost the same way - postponing index creation. JK> I suspect you have some customers that have already encountered issues in managing large JK> table sizes. What solutions have worked in that past? The solutions I presented are most commonly used. Also customers often spread large volumes of data across server farm. I hope this solutions will help you. I'm sure you're going to benchmark them. Could you please post results into the list so all customers will see which benefit they may get. -- For technical support contracts, visit https://order.mysql.com/?ref=mpza __ ___ ___ ____ __ / |/ /_ __/ __/ __ \/ / Mr. Peter Zaitsev <[EMAIL PROTECTED]> / /|_/ / // /\ \/ /_/ / /__ MySQL AB, Full-Time Developer /_/ /_/\_, /___/\___\_\___/ Moscow, Russia <___/ www.mysql.com M: +7 095 725 4955 --------------------------------------------------------------------- Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php