Hello John,

Friday, June 7, 2002, 1:54:52 AM, you wrote:


JK> Num refers to the text file from which the data is read. It contains the count of
JK> records shown in the Records column, followed by the total number read in to that 
point.
JK> The time in seconds to insert these records is under the Time column. And the 
rightmost
JK> column shows the average insertion rate. So far the speed is pretty impressive. 
But I
JK> have a LOT of data to index. Look at how the performance nosedives.

JK> 020  1353759    8377979  1172.17   1154.9
JK> 030  2147433   27184976  3859.09    556.5
JK> 040  2170491   49124176  6384.61    340.0
JK> 050  1964395   69808672  8363.19    234.9
JK> 060  1675533   87925330  9651.06    173.6
JK> 070  1359195  102944321 10037.75    135.4
JK> 080  1053478  114868539  9628.03    109.4
JK> 090   747348  123503148  8278.75     90.3
JK> 100   553939  129876045 13612.53     40.7

JK> At this point, with 130 million record, the size of data table and index are about 
28GB.
JK> Because I knew the files would exceed 4GB I set it up for "big times" with an ALTER
JK> TABLE command, AVG_ROW_LENGTH = 512, and MAX_ROWS = 64x2^30. That provides ample 
room
JK> for growth, but maybe it negatively affecs indexing speeds?

Of course 64bit data pointers will make index and data file a  bit
slower and large, but unfortunately you have not much choice here.
You may try creating table with PACK_KEYS=1 which may give you a bit
of performance (it may produce minor slowdown as well depending on the
application so you need to check it)

There are many speed issues with INSERT statement speed - please take
a look at appropriate section in MySQL manual.

JK> To populate the database I wrote a python script that reads each line of the input 
file
JK> and issues this SQL command:

JK> INSERT IGNORE INTO table (name, num, val) VALUES (a,b,c)

Why are you using ignore ? Do you have duplicate rows ?

Also Multiple values inserts are recommended  (will give much faster
speed)

As well you may wish to create your table with  DELAY_KEY_WRITES=1 or
use LOCK TABLES/UNLOCK TABLES  around the series of inserts in this
table - this will not force MySQL to flush modified key blocks under
each statement.

And the fastest bulk data load method for MySQL is "LOAD DATA". Please
take a look if this will work for you.


JK> And the corresponding table definitions is as follows. (I've simplified this 
example.
JK> There are other columns defined that are not updated in this processing stage.)

JK> CREATE TABLE table
JK>   (id   BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
JK>    name VARCHAR(120) NOT NULL,
JK>    num  SMALLINT NOT NULL,
JK>    val  TINYINT UNSIGNED NOT NULL,
JK>    time TIMESTAMP NOT NULL,
JK>    INDEX(NAME), INDEX(num), UNIQUE(name))

JK> I'm indexing two columns in addition to the primary key, which is defined as a 
BIGINT.
JK> The name string averages 80 characters and is constrained to be unique.
1) You have created 2 indexes on NAME field which of course reduces
speed.  Please drop  INDEX(NAME)   as you have UNIQUE(NAME)
2) As you have Average length of name more then 60% of maximum length
it's faster to use CHAR() column type and static length rows. This
will have moderate overhead but will improve speed.
JK>  In contrast, the
JK> num column is densely packed -- its values range from 1-100 over 130 million rows.

JK> Given all this, where do you suppose the bottleneck lies? Clearly, the btree 
indexes are
JK> getting large, with many disk hits required to find a record. But more 
specifically, is
JK> there an alternative to the layout that could alleviate the problem?

Unfortunately MySQL does not currently support other index types for
MYISAM tables.  Most of optimizations you may do to reduce index size
are mentioned above.  Also you should consider increasing key_buffer
size  so more index access will be cached.   Which one you currently
have and what is the total system memory size ?

Of course you will be unable to have key_buffer which will cover all
the index but with Btrees the most important to cover the topmost
levels of the tree which is possible.   Thus this will still require
some physical IO for large volumes.



JK>  For example, not
JK> declaring UNIQUE(name) and instead placing the responsibility of the guarantee 
outside
JK> of MySql.

Yes. The index on name field is the most space consuming index and
dropping it at all will improve speed because of higher key_cache hit
rate.

Depending on which select queries you have possible solution is
adding special  CHECKSUM field which will contain  INT/BIGINT value
corresponding to  some sort of checksum for the  name field.

Of course you can't rely on it to be unique but this index will be
raver fast for  "ref" name lookup. Of course it will not help range
queries.


JK>  Or, does creating tables capable of growing past 4GB carry an inherent
JK> penalty? Or are the non-word-size integers slowing things down with the overhead 
of bit
JK> packing?
The great issue with your application is disk IO.  So these issues do
not matter such a lot.

One possible optimization solution for your application may be to have
series of tables so each one will be small enough to have index
fitting into memory with MERGE table across them. This will increase
insert speed a lot but slowdown select statements a bit. Also you will
have to handle auto_increment and unique manually.

JK>  Not having much experience the with MySql product, it's hard for me to finger
JK> the likely suspect. As another idea, would it be advisable to insert all the data 
into
JK> an un-indexed table (except for the primary key) and then perform one large 
indexing
JK> operation after all the data is loaded?

For one time bulk load "LOAD DATA" is the best solution. It behaves
almost the same way - postponing index creation.

JK> I suspect you have some customers that have already encountered issues in managing 
large
JK> table sizes. What solutions have worked in that past?

The solutions I presented are most commonly used. Also  customers
often spread large volumes of data across server farm.

I hope this solutions will help you.   I'm sure you're going to
benchmark them.  Could you please post results into the list so all
customers will see which benefit they may get.





-- 
For technical support contracts, visit https://order.mysql.com/?ref=mpza
   __  ___     ___ ____  __
  /  |/  /_ __/ __/ __ \/ /    Mr. Peter Zaitsev <[EMAIL PROTECTED]>
 / /|_/ / // /\ \/ /_/ / /__   MySQL AB, Full-Time Developer
/_/  /_/\_, /___/\___\_\___/   Moscow, Russia
       <___/   www.mysql.com   M: +7 095 725 4955


---------------------------------------------------------------------
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/           (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Reply via email to