Re: Indexing about 40 Billion Entries

Christian Koetteritzsch Thu, 21 Jun 2012 00:05:30 -0700

Thank you a lot. The first indexing process finished after about 13hours, so I think the problem is solved now.

I set the myisam_sort_bufffer_size to 10GB.

For the query I will adjust it to your version.


Am 20.06.2012 23:32, schrieb Rick James:

SELECT ruid1, ruid2, overlap FROM l4_link WHERE ruid1=xxx OR ruid2=xxx ORDER BY 
overlap DESC

MySQL does not optimize that kind of OR well.  This will run _much_ faster 
(with your two indexes):

(
    SELECT ruid1, ruid2, overlap FROM l4_link WHERE ruid1=xxx
    UNION
    SELECT ruid1, ruid2, overlap FROM l4_link WHERE ruid2=xxx
) ORDER BY overlap DESC;

Make it UNION ALL or UNION DISTINCT depending on whether xxx can be in both 
fields of one row.  UNION DISTINCT makes a pass over the temp table to dedup.

Your version (OR) will do a table scan.

My version will:
1. scan ruid1 index (.MYI) (1+ disk hits for "index range scan")
2. for each matching ruid1, fetch the data row (in .MYD) (1 hit/row); write to 
tmp table
3,4.  ditto for ruid2, appending to same tmp table
5. sort tmp table (ORDER BY) (probably 0 disk hits, due to using MEMORY)

Once the indexes are in place...  Depending on the version you are using, 
key_buffer_size is limited to 4G or is not.  For your huge machine, 24G might 
be wise.  The key_buffer will help steps 1,3.  The rest of RAM will be 
available for the OS to cache the data blocks (2,4).

The 16 cores -- A single query (even with the UNION I suggested) will use only 
one core.

How many rows (average, max) do you expect from
    SELECT ... FROM l4_link WHERE ruid1=xxx
I ask, because that might run as slow as 100 rows/sec., simply because of the 
disk hits.


I'm expecting  an average of 5000 rows.


PARTITIONing, per se, does not help performance.  There are only a few use 
cases where PARTITION shines.  (I have not seen such [yet] in your application.)

-----Original Message-----
From: Christian Koetteritzsch [mailto:ckoetteritz...@e-humanities.net]
Sent: Wednesday, June 20, 2012 1:42 PM
To: Rick James
Cc: Ananda Kumar; mysql@lists.mysql.com
Subject: Re: Indexing about 40 Billion Entries

Thanks for the information. It is no problem if it takes days or weeks,
because the server is specially for such tasks that takes time and uses
lots of resources.

Am 20.06.2012 19:55, schrieb Rick James:

Even if you get past the "REPAIR WITH KEYCACHE", the ALTER will still

take days, maybe weeks.

I strongly recommend you do not try to access that many rows

directly.  Instead, build summary tables, and access them.  We can
discuss further.
Did you mean that I make several partitions with for example 1 Billion
Entries and than make a union of the results from the partitions?

Unfortunately, you did not even include a PRIMARY KEY when you built

the table.  This makes any operations slow.
There is no PRIMARY KEY because ruid1 and ruid2 are not unique values.

I need an index for ruid1 and for ruid2.

What will your SELECTs look like?  You may be better off with
INDEX(ruid1, ruid2, overlap), and INDEX(ruid2, ruid1, overlap)

Will you be adding more rows to this table?  Or is it now static?

No data will be added to this table so it is a static table.  And the
index, as i have planed them will be INDEX(ruid1), and INDEX(ruid2).

My SELECT looks like this:

SELECT ruid1, ruid2, overlap FROM l4_link WHERE ruid1=xxx OR ruid2=xxx

and xxx are the same int values

What are the semantics of the fields?

The semantic behind these fields are, that ruid1 and ruid2 are ID's for
sentences that have common words and overlap is the number of words
they have in common.

I hope the informations I gave you are helpful. If you have more
questions than you can ask them.

Kind regards

Christian

-----Original Message-----
From: Ananda Kumar [mailto:anan...@gmail.com]
Sent: Wednesday, June 20, 2012 4:37 AM
To: Christian Koetteritzsch
Cc:mysql@lists.mysql.com
Subject: Re: Indexing about 40 Billion Entries

looks like the value that you give for myisam_max_sort_size is not
enough for the index creation and hence it doing a "REPAIR WITH
KEYCACHE"

Use the below query to set the min values required for
myisam_max_sort_size to avoid "repair with keycache"


select
          a.index_name as index_name,
          ceil(
          (select count(*) from `db_name`.`table_name`) *(
          @@global.myisam_data_pointer_size +
          sum(ifnull(1 + `a`.`sub_part` *
(`b`.`character_octet_length`/`b`.`character_maximum_length`),
                  ifnull(1 + `b`.`character_octet_length`,
                          case
                                  when `b`.`data_type` = 'tinyint'

then 1

                                  when `b`.`data_type` = 'smallint'
then
2
                                  when `b`.`data_type` = 'mediumint'
then
3
                                  when `b`.`data_type` = 'int' then 4
                                  when `b`.`data_type` = 'bigint'

then 8

                                  when `b`.`data_type` = 'float' then

                                  when `b`.`data_type` = 'double'

then 8

                                  when `b`.`data_type` = 'real' then

                                  when `b`.`data_type` = 'bit' then 8
                                  when `b`.`data_type` = 'date' then

                                  when `b`.`data_type` = 'datetime'
then
8
                                  when `b`.`data_type` = 'timestamp'
then
4
                                  when `b`.`data_type` = 'time' then

                                  when `b`.`data_type` = 'year' then

                                  when `b`.`data_type` = 'enum' then

                                  when `b`.`data_type` = 'set' then 8
                                  when `b`.`data_type` = 'decimal'
then 8

                          end
                          )

          )
          + if(`a`.`nullable`='YES',1,0)
          + if(`b`.`character_octet_length` >=255,2,0)
          ))/1048576)*1048576  as `index_size` from
`information_schema`.`statistics` `a`,

`information_schema`.`columns`

`b` where
          `a`.`table_name`=`b`.`table_name` and
          `a`.`table_schema`=`b`.`table_schema` and
          `a`.`column_name`=`b`.`column_name` and
          `a`.`table_schema`='db_name' and
          `a`.`table_name`='table_name'

group by `a`.`index_name`
order by `index_size` desc limit 1;

On Wed, Jun 20, 2012 at 3:15 PM, Christian Koetteritzsch <
ckoetteritz...@e-humanities.net> wrote:

Hi guys,

As the title says I'm trying to index 40 billion entries with two
indexes on a server with 16 cores and 128GB RAM. The table is the
one below and it is a myisam table. The *.myd file is about 640GB

DROP TABLE IF EXISTS `l4_link`;
CREATE TABLE  `l4_link` (
   `ruid1` int NOT NULL,
   `ruid2` int NOT NULL,
   `overlap` int NOT NULL
);

I need an index for ruid1 and for ruid2.

The status for this table is the following:

            Name: l4_link
          Engine: MyISAM
         Version: 10
      Row_format: Fixed
            Rows: 39806500262
   Avg_row_length: 17
     Data_length: 676710504454
Max_data_length: 4785074604081151
    Index_length: 1024
       Data_free: 0
   Auto_increment: NULL
     Create_time: 2012-06-19 14:51:29
     Update_time: 2012-06-19 16:26:35
      Check_time: NULL
       Collation: utf8_general_ci
        Checksum: NULL
   Create_options:
         Comment:

The variables for myisam are the following:
mysql> show global variables like '%myisam%';
+-----------------------------**-----------+------------------**
---------------------+
| Variable_name                       | Value
           |
+-----------------------------**-----------+------------------**
---------------------+
| myisam_data_pointer_size  | 6         |
| myisam_max_sort_file_size | 9223372036853727232      |
| myisam_mmap_size             | 18446744073709551615    |
| myisam_recover_options     | BACKUP       |
| myisam_repair_threads       | 1             |
| myisam_sort_buffer_size    | 8388608

| myisam_stats_method        | nulls_unequal |
| myisam_use_mmap             | OFF
     |
+-----------------------------**----------+-------------------**
---------------------+
8 rows in set (0.00 sec)

The temp folder has about 16tb free space.

When I start the indexing process, it copies the 640Gb into a temp
file and then starts with "repair with keycache".
On the internet I found that if it says "repair with keycache" you
shold increase the "myisam_max_sort_file_size", but this didn't

work.

It still says "repair with keycache" after it copied the data to

the

temp dir.

I hope you have any idea how to fix this.

Thanks in advance.

Christian


--
MySQL General Mailing List
For list archives:http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql




--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/mysql

Re: Indexing about 40 Billion Entries

Reply via email to