> SELECT ruid1, ruid2, overlap FROM l4_link WHERE ruid1=xxx OR ruid2=xxx ORDER > BY overlap DESC
MySQL does not optimize that kind of OR well. This will run _much_ faster (with your two indexes): ( SELECT ruid1, ruid2, overlap FROM l4_link WHERE ruid1=xxx UNION SELECT ruid1, ruid2, overlap FROM l4_link WHERE ruid2=xxx ) ORDER BY overlap DESC; Make it UNION ALL or UNION DISTINCT depending on whether xxx can be in both fields of one row. UNION DISTINCT makes a pass over the temp table to dedup. Your version (OR) will do a table scan. My version will: 1. scan ruid1 index (.MYI) (1+ disk hits for "index range scan") 2. for each matching ruid1, fetch the data row (in .MYD) (1 hit/row); write to tmp table 3,4. ditto for ruid2, appending to same tmp table 5. sort tmp table (ORDER BY) (probably 0 disk hits, due to using MEMORY) Once the indexes are in place... Depending on the version you are using, key_buffer_size is limited to 4G or is not. For your huge machine, 24G might be wise. The key_buffer will help steps 1,3. The rest of RAM will be available for the OS to cache the data blocks (2,4). The 16 cores -- A single query (even with the UNION I suggested) will use only one core. How many rows (average, max) do you expect from SELECT ... FROM l4_link WHERE ruid1=xxx I ask, because that might run as slow as 100 rows/sec., simply because of the disk hits. PARTITIONing, per se, does not help performance. There are only a few use cases where PARTITION shines. (I have not seen such [yet] in your application.) > -----Original Message----- > From: Christian Koetteritzsch [mailto:ckoetteritz...@e-humanities.net] > Sent: Wednesday, June 20, 2012 1:42 PM > To: Rick James > Cc: Ananda Kumar; mysql@lists.mysql.com > Subject: Re: Indexing about 40 Billion Entries > > Thanks for the information. It is no problem if it takes days or weeks, > because the server is specially for such tasks that takes time and uses > lots of resources. > > Am 20.06.2012 19:55, schrieb Rick James: > > Even if you get past the "REPAIR WITH KEYCACHE", the ALTER will still > take days, maybe weeks. > > > > I strongly recommend you do not try to access that many rows > directly. Instead, build summary tables, and access them. We can > discuss further. > Did you mean that I make several partitions with for example 1 Billion > Entries and than make a union of the results from the partitions? > > > > Unfortunately, you did not even include a PRIMARY KEY when you built > the table. This makes any operations slow. > There is no PRIMARY KEY because ruid1 and ruid2 are not unique values. > > > >> I need an index for ruid1 and for ruid2. > > What will your SELECTs look like? You may be better off with > > INDEX(ruid1, ruid2, overlap), and INDEX(ruid2, ruid1, overlap) > > > > Will you be adding more rows to this table? Or is it now static? > No data will be added to this table so it is a static table. And the > index, as i have planed them will be INDEX(ruid1), and INDEX(ruid2). > > My SELECT looks like this: > > SELECT ruid1, ruid2, overlap FROM l4_link WHERE ruid1=xxx OR ruid2=xxx > > and xxx are the same int values > > > What are the semantics of the fields? > The semantic behind these fields are, that ruid1 and ruid2 are ID's for > sentences that have common words and overlap is the number of words > they have in common. > > I hope the informations I gave you are helpful. If you have more > questions than you can ask them. > > Kind regards > > Christian > > > > > >> -----Original Message----- > >> From: Ananda Kumar [mailto:anan...@gmail.com] > >> Sent: Wednesday, June 20, 2012 4:37 AM > >> To: Christian Koetteritzsch > >> Cc:mysql@lists.mysql.com > >> Subject: Re: Indexing about 40 Billion Entries > >> > >> looks like the value that you give for myisam_max_sort_size is not > >> enough for the index creation and hence it doing a "REPAIR WITH > >> KEYCACHE" > >> > >> Use the below query to set the min values required for > >> myisam_max_sort_size to avoid "repair with keycache" > >> > >> > >> select > >> a.index_name as index_name, > >> ceil( > >> (select count(*) from `db_name`.`table_name`) *( > >> @@global.myisam_data_pointer_size + > >> sum(ifnull(1 + `a`.`sub_part` * > >> (`b`.`character_octet_length`/`b`.`character_maximum_length`), > >> ifnull(1 + `b`.`character_octet_length`, > >> case > >> when `b`.`data_type` = 'tinyint' > then 1 > >> when `b`.`data_type` = 'smallint' > >> then > >> 2 > >> when `b`.`data_type` = 'mediumint' > >> then > >> 3 > >> when `b`.`data_type` = 'int' then 4 > >> when `b`.`data_type` = 'bigint' > then 8 > >> when `b`.`data_type` = 'float' then > 4 > >> when `b`.`data_type` = 'double' > then 8 > >> when `b`.`data_type` = 'real' then > 8 > >> when `b`.`data_type` = 'bit' then 8 > >> when `b`.`data_type` = 'date' then > 3 > >> when `b`.`data_type` = 'datetime' > >> then > >> 8 > >> when `b`.`data_type` = 'timestamp' > >> then > >> 4 > >> when `b`.`data_type` = 'time' then > 3 > >> when `b`.`data_type` = 'year' then > 1 > >> when `b`.`data_type` = 'enum' then > 2 > >> when `b`.`data_type` = 'set' then 8 > >> when `b`.`data_type` = 'decimal' > >> then 8 > >> > >> end > >> ) > >> > >> ) > >> + if(`a`.`nullable`='YES',1,0) > >> + if(`b`.`character_octet_length` >=255,2,0) > >> ))/1048576)*1048576 as `index_size` from > >> `information_schema`.`statistics` `a`, > `information_schema`.`columns` > >> `b` where > >> `a`.`table_name`=`b`.`table_name` and > >> `a`.`table_schema`=`b`.`table_schema` and > >> `a`.`column_name`=`b`.`column_name` and > >> `a`.`table_schema`='db_name' and > >> `a`.`table_name`='table_name' > >> > >> group by `a`.`index_name` > >> order by `index_size` desc limit 1; > >> > >> On Wed, Jun 20, 2012 at 3:15 PM, Christian Koetteritzsch < > >> ckoetteritz...@e-humanities.net> wrote: > >> > >>> Hi guys, > >>> > >>> As the title says I'm trying to index 40 billion entries with two > >>> indexes on a server with 16 cores and 128GB RAM. The table is the > >>> one below and it is a myisam table. The *.myd file is about 640GB > >>> > >>> DROP TABLE IF EXISTS `l4_link`; > >>> CREATE TABLE `l4_link` ( > >>> `ruid1` int NOT NULL, > >>> `ruid2` int NOT NULL, > >>> `overlap` int NOT NULL > >>> ); > >>> > >>> I need an index for ruid1 and for ruid2. > >>> > >>> The status for this table is the following: > >>> > >>> Name: l4_link > >>> Engine: MyISAM > >>> Version: 10 > >>> Row_format: Fixed > >>> Rows: 39806500262 > >>> Avg_row_length: 17 > >>> Data_length: 676710504454 > >>> Max_data_length: 4785074604081151 > >>> Index_length: 1024 > >>> Data_free: 0 > >>> Auto_increment: NULL > >>> Create_time: 2012-06-19 14:51:29 > >>> Update_time: 2012-06-19 16:26:35 > >>> Check_time: NULL > >>> Collation: utf8_general_ci > >>> Checksum: NULL > >>> Create_options: > >>> Comment: > >>> > >>> The variables for myisam are the following: > >>> mysql> show global variables like '%myisam%'; > >>> +-----------------------------**-----------+------------------** > >>> ---------------------+ > >>> | Variable_name | Value > >>> | > >>> +-----------------------------**-----------+------------------** > >>> ---------------------+ > >>> | myisam_data_pointer_size | 6 | > >>> | myisam_max_sort_file_size | 9223372036853727232 | > >>> | myisam_mmap_size | 18446744073709551615 | > >>> | myisam_recover_options | BACKUP | > >>> | myisam_repair_threads | 1 | > >>> | myisam_sort_buffer_size | 8388608 > >> | > >>> | myisam_stats_method | nulls_unequal | > >>> | myisam_use_mmap | OFF > >>> | > >>> +-----------------------------**----------+-------------------** > >>> ---------------------+ > >>> 8 rows in set (0.00 sec) > >>> > >>> The temp folder has about 16tb free space. > >>> > >>> When I start the indexing process, it copies the 640Gb into a temp > >>> file and then starts with "repair with keycache". > >>> On the internet I found that if it says "repair with keycache" you > >>> shold increase the "myisam_max_sort_file_size", but this didn't > work. > >>> It still says "repair with keycache" after it copied the data to > the > >>> temp dir. > >>> > >>> I hope you have any idea how to fix this. > >>> > >>> Thanks in advance. > >>> > >>> Christian > >>> > >>> > >>> -- > >>> MySQL General Mailing List > >>> For list archives:http://lists.mysql.com/mysql > >>> To unsubscribe:http://lists.mysql.com/mysql > >>> > >>> > -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/mysql