RE: Indexing about 40 Billion Entries

Rick James Wed, 20 Jun 2012 15:05:06 -0700

> SELECT ruid1, ruid2, overlap FROM l4_link WHERE ruid1=xxx OR ruid2=xxx ORDER 
> BY overlap DESC


MySQL does not optimize that kind of OR well.  This will run _much_ faster 
(with your two indexes):

(
   SELECT ruid1, ruid2, overlap FROM l4_link WHERE ruid1=xxx
   UNION
   SELECT ruid1, ruid2, overlap FROM l4_link WHERE ruid2=xxx
) ORDER BY overlap DESC;

Make it UNION ALL or UNION DISTINCT depending on whether xxx can be in both 
fields of one row.  UNION DISTINCT makes a pass over the temp table to dedup.

Your version (OR) will do a table scan.

My version will:
1. scan ruid1 index (.MYI) (1+ disk hits for "index range scan")
2. for each matching ruid1, fetch the data row (in .MYD) (1 hit/row); write to 
tmp table
3,4.  ditto for ruid2, appending to same tmp table
5. sort tmp table (ORDER BY) (probably 0 disk hits, due to using MEMORY)

Once the indexes are in place...  Depending on the version you are using, 
key_buffer_size is limited to 4G or is not.  For your huge machine, 24G might 
be wise.  The key_buffer will help steps 1,3.  The rest of RAM will be 
available for the OS to cache the data blocks (2,4).

The 16 cores -- A single query (even with the UNION I suggested) will use only 
one core.

How many rows (average, max) do you expect from
   SELECT ... FROM l4_link WHERE ruid1=xxx
I ask, because that might run as slow as 100 rows/sec., simply because of the 
disk hits.

PARTITIONing, per se, does not help performance.  There are only a few use 
cases where PARTITION shines.  (I have not seen such [yet] in your application.)

> -----Original Message-----
> From: Christian Koetteritzsch [mailto:ckoetteritz...@e-humanities.net]
> Sent: Wednesday, June 20, 2012 1:42 PM
> To: Rick James
> Cc: Ananda Kumar; mysql@lists.mysql.com
> Subject: Re: Indexing about 40 Billion Entries
> 
> Thanks for the information. It is no problem if it takes days or weeks,
> because the server is specially for such tasks that takes time and uses
> lots of resources.
> 
> Am 20.06.2012 19:55, schrieb Rick James:
> > Even if you get past the "REPAIR WITH KEYCACHE", the ALTER will still
> take days, maybe weeks.
> >
> > I strongly recommend you do not try to access that many rows
> directly.  Instead, build summary tables, and access them.  We can
> discuss further.
> Did you mean that I make several partitions with for example 1 Billion
> Entries and than make a union of the results from the partitions?
> >
> > Unfortunately, you did not even include a PRIMARY KEY when you built
> the table.  This makes any operations slow.
> There is no PRIMARY KEY because ruid1 and ruid2 are not unique values.
> >
> >> I need an index for ruid1 and for ruid2.
> > What will your SELECTs look like?  You may be better off with
> > INDEX(ruid1, ruid2, overlap), and INDEX(ruid2, ruid1, overlap)
> >
> > Will you be adding more rows to this table?  Or is it now static?
> No data will be added to this table so it is a static table.  And the
> index, as i have planed them will be INDEX(ruid1), and INDEX(ruid2).
> 
> My SELECT looks like this:
> 
> SELECT ruid1, ruid2, overlap FROM l4_link WHERE ruid1=xxx OR ruid2=xxx
> 
> and xxx are the same int values
> 
> > What are the semantics of the fields?
> The semantic behind these fields are, that ruid1 and ruid2 are ID's for
> sentences that have common words and overlap is the number of words
> they have in common.
> 
> I hope the informations I gave you are helpful. If you have more
> questions than you can ask them.
> 
> Kind regards
> 
> Christian
> >
> >
> >> -----Original Message-----
> >> From: Ananda Kumar [mailto:anan...@gmail.com]
> >> Sent: Wednesday, June 20, 2012 4:37 AM
> >> To: Christian Koetteritzsch
> >> Cc:mysql@lists.mysql.com
> >> Subject: Re: Indexing about 40 Billion Entries
> >>
> >> looks like the value that you give for myisam_max_sort_size is not
> >> enough for the index creation and hence it doing a "REPAIR WITH
> >> KEYCACHE"
> >>
> >> Use the below query to set the min values required for
> >> myisam_max_sort_size to avoid "repair with keycache"
> >>
> >>
> >> select
> >>          a.index_name as index_name,
> >>          ceil(
> >>          (select count(*) from `db_name`.`table_name`) *(
> >>          @@global.myisam_data_pointer_size +
> >>          sum(ifnull(1 + `a`.`sub_part` *
> >> (`b`.`character_octet_length`/`b`.`character_maximum_length`),
> >>                  ifnull(1 + `b`.`character_octet_length`,
> >>                          case
> >>                                  when `b`.`data_type` = 'tinyint'
> then 1
> >>                                  when `b`.`data_type` = 'smallint'
> >> then
> >> 2
> >>                                  when `b`.`data_type` = 'mediumint'
> >> then
> >> 3
> >>                                  when `b`.`data_type` = 'int' then 4
> >>                                  when `b`.`data_type` = 'bigint'
> then 8
> >>                                  when `b`.`data_type` = 'float' then
> 4
> >>                                  when `b`.`data_type` = 'double'
> then 8
> >>                                  when `b`.`data_type` = 'real' then
> 8
> >>                                  when `b`.`data_type` = 'bit' then 8
> >>                                  when `b`.`data_type` = 'date' then
> 3
> >>                                  when `b`.`data_type` = 'datetime'
> >> then
> >> 8
> >>                                  when `b`.`data_type` = 'timestamp'
> >> then
> >> 4
> >>                                  when `b`.`data_type` = 'time' then
> 3
> >>                                  when `b`.`data_type` = 'year' then
> 1
> >>                                  when `b`.`data_type` = 'enum' then
> 2
> >>                                  when `b`.`data_type` = 'set' then 8
> >>                                  when `b`.`data_type` = 'decimal'
> >> then 8
> >>
> >>                          end
> >>                          )
> >>
> >>          )
> >>          + if(`a`.`nullable`='YES',1,0)
> >>          + if(`b`.`character_octet_length` >=255,2,0)
> >>          ))/1048576)*1048576  as `index_size` from
> >> `information_schema`.`statistics` `a`,
> `information_schema`.`columns`
> >> `b` where
> >>          `a`.`table_name`=`b`.`table_name` and
> >>          `a`.`table_schema`=`b`.`table_schema` and
> >>          `a`.`column_name`=`b`.`column_name` and
> >>          `a`.`table_schema`='db_name' and
> >>          `a`.`table_name`='table_name'
> >>
> >> group by `a`.`index_name`
> >> order by `index_size` desc limit 1;
> >>
> >> On Wed, Jun 20, 2012 at 3:15 PM, Christian Koetteritzsch <
> >> ckoetteritz...@e-humanities.net> wrote:
> >>
> >>> Hi guys,
> >>>
> >>> As the title says I'm trying to index 40 billion entries with two
> >>> indexes on a server with 16 cores and 128GB RAM. The table is the
> >>> one below and it is a myisam table. The *.myd file is about 640GB
> >>>
> >>> DROP TABLE IF EXISTS `l4_link`;
> >>> CREATE TABLE  `l4_link` (
> >>>   `ruid1` int NOT NULL,
> >>>   `ruid2` int NOT NULL,
> >>>   `overlap` int NOT NULL
> >>> );
> >>>
> >>> I need an index for ruid1 and for ruid2.
> >>>
> >>> The status for this table is the following:
> >>>
> >>>            Name: l4_link
> >>>          Engine: MyISAM
> >>>         Version: 10
> >>>      Row_format: Fixed
> >>>            Rows: 39806500262
> >>>   Avg_row_length: 17
> >>>     Data_length: 676710504454
> >>> Max_data_length: 4785074604081151
> >>>    Index_length: 1024
> >>>       Data_free: 0
> >>>   Auto_increment: NULL
> >>>     Create_time: 2012-06-19 14:51:29
> >>>     Update_time: 2012-06-19 16:26:35
> >>>      Check_time: NULL
> >>>       Collation: utf8_general_ci
> >>>        Checksum: NULL
> >>>   Create_options:
> >>>         Comment:
> >>>
> >>> The variables for myisam are the following:
> >>> mysql> show global variables like '%myisam%';
> >>> +-----------------------------**-----------+------------------**
> >>> ---------------------+
> >>> | Variable_name                       | Value
> >>>           |
> >>> +-----------------------------**-----------+------------------**
> >>> ---------------------+
> >>> | myisam_data_pointer_size  | 6         |
> >>> | myisam_max_sort_file_size | 9223372036853727232      |
> >>> | myisam_mmap_size             | 18446744073709551615    |
> >>> | myisam_recover_options     | BACKUP       |
> >>> | myisam_repair_threads       | 1             |
> >>> | myisam_sort_buffer_size    | 8388608
> >> |
> >>> | myisam_stats_method        | nulls_unequal |
> >>> | myisam_use_mmap             | OFF
> >>>     |
> >>> +-----------------------------**----------+-------------------**
> >>> ---------------------+
> >>> 8 rows in set (0.00 sec)
> >>>
> >>> The temp folder has about 16tb free space.
> >>>
> >>> When I start the indexing process, it copies the 640Gb into a temp
> >>> file and then starts with "repair with keycache".
> >>> On the internet I found that if it says "repair with keycache" you
> >>> shold increase the "myisam_max_sort_file_size", but this didn't
> work.
> >>> It still says "repair with keycache" after it copied the data to
> the
> >>> temp dir.
> >>>
> >>> I hope you have any idea how to fix this.
> >>>
> >>> Thanks in advance.
> >>>
> >>> Christian
> >>>
> >>>
> >>> --
> >>> MySQL General Mailing List
> >>> For list archives:http://lists.mysql.com/mysql
> >>> To unsubscribe:http://lists.mysql.com/mysql
> >>>
> >>>
> 


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/mysql

RE: Indexing about 40 Billion Entries

Reply via email to