subject:"Re\: 回复： Why is creating indexes faster after inserting massive data rows\?"

RE: 回复： Why is creating indexes faster after inserting massive data rows?

2012-05-09 Thread Rick James

A BTree that is small enough to be cached in RAM can be quickly maintained.  
Even the “block splits” are not too costly without the I/O.

A big file that needs sorting – bigger than can be cached in RAM – is more 
efficiently done with a dedicated “sort merge” program.  A “big” INDEX on a 
table may be big enough to fall into this category.

I/O is the most costly part of any of these operations.  My rule of thumb for 
MySQL SQL statements is:  If everything is cached, the query will run ten times 
as fast as it would if things have to be fetched from disk.

Sortmerge works this way:

1.   Sort as much of the file as you can in RAM.  Write that sorted piece 
to disk.

2.   Repeat for the next chunk of the file.  Repeat until the input file is 
broken into sorted chunks.

3.   Now, “merge” those chunks together – take the first row from  each, 
decide which is the “smallest”, send it to the output

4.   Repeat until finished with all the pieces.
For a really big task, there may have to be more than on “merge” pass.
Note how sort merge reads the input sequentially once, writes the output 
sequentially once, and has sequential I/O for each merge chunk.
“Sequential” I/O is faster than “random” I/O – no arm motion on traditional 
disks.  (SSDs are a different matter; I won’t go into that.)

The “output” from the sortmerge is fed into code that builds the BTree for the 
table.  This building of the BTree is sequential – fill the first block, move 
on to the next block, and never have to go back.

BTrees (when built randomly), if they need to spill to disk, will involve 
random I/O.  (And we are talking about an INDEX that is so big that it needs to 
spill to disk.)

When a block “splits”, one full block becomes two half-full blocks.  Randomly 
filling a BTree leads to, on average, the index being 69% full.  This is not a 
big factor in the overall issue, but perhaps worth noting.

How bad can it get?  Here’s an example.

· You have an INDEX on some random value, such as a GUID or MD5.

· The INDEX will be 5 times as big as you can fit in RAM.

· MySQL is adding to the BTree one row at a time (the non-sortmerge way)
When it is nearly finished, only 1 of 5 updates to the BTree can be done 
immediately in RAM; 4 out of 5 updates to the BTree will have to hit disk.  If 
you are using normal disks, that is on the order of 125 rows per second that 
you can insert – Terrible!  Sortmerge is likely to average over 10,000.



From: Zhangzhigang [mailto:zzgang_2...@yahoo.com.cn]
Sent: Tuesday, May 08, 2012 9:13 PM
To: Rick James
Cc: mysql@lists.mysql.com
Subject: 回复： Why is creating indexes faster after inserting massive data rows?

James...
* By doing all the indexes after building the table (or at least all the 
non-UNIQUE indexes), sort merge can be used.  This technique had been highly 
optimized over the past half-century, and is more efficient.

I have a question about sort merge:

Why does it do the all sort merge?

In my opinion, it just maintains the B tree and inserts one key into a B tree 
node which has fewer sorted keys, so it is good performance.

If it only does the sort merge, the B tree data structure have to been 
created separately. it wastes some performance.

Does it?



发件人： Rick James rja...@yahoo-inc.commailto:rja...@yahoo-inc.com
收件人： Johan De Meersman vegiv...@tuxera.bemailto:vegiv...@tuxera.be; 
Zhangzhigang zzgang_2...@yahoo.com.cnmailto:zzgang_2...@yahoo.com.cn
抄送： mysql@lists.mysql.commailto:mysql@lists.mysql.com 
mysql@lists.mysql.commailto:mysql@lists.mysql.com
发送日期： 2012年5月8日, 星期二, 上午 12:35
主题: RE: Why is creating indexes faster after inserting massive data rows?

* Batch INSERTs run faster than one-row-at-a-time, but this is unrelated to 
INDEX updating speed.
* The cache size is quite important to dealing with indexing during INSERT; see 
http://mysql.rjweb.org/doc.php/memory http://mysql.rjweb.org/doc.php/memory%0A
* Note that mysqldump sets up for an efficient creation of indexes after 
loading the data.  This is not practical (or necessarily efficient) when 
incremental INSERTing into a table.

As for the original question...
* Updating the index(es) for one row often involves random BTree traversals.  
When the index(es) are too big to be cached, this can involve disk hit(s) for 
each row inserted.
* By doing all the indexes after building the table (or at least all the 
non-UNIQUE indexes), sort merge can be used.  This technique had been highly 
optimized over the past half-century, and is more efficient.


 -Original Message-
 From: Johan De Meersman [mailto:vegiv...@tuxera.bemailto:vegiv...@tuxera.be]
 Sent: Monday, May 07, 2012 1:29 AM
 To: Zhangzhigang
 Cc: mysql@lists.mysql.commailto:mysql@lists.mysql.com
 Subject: Re: Why is creating indexes faster after inserting massive
 data rows?

 - Original Message -
  From: Zhangzhigang 
  zzgang_2...@yahoo.com.cnmailto:zzgang_2...@yahoo.com.cn

Re: 回复： Why is creating indexes faster after inserting massive data rows?

2012-05-09 Thread Claudio Nanni

This thread is going on and on and on and on,
does anyone have time to actually measure I/O?
Let's make numbers talk.

Claudio


2012/5/9 Rick James rja...@yahoo-inc.com

 A BTree that is small enough to be cached in RAM can be quickly
 maintained.  Even the “block splits” are not too costly without the I/O.

 A big file that needs sorting �C bigger than can be cached in RAM �C is more
 efficiently done with a dedicated “sort merge” program.  A “big” INDEX on a
 table may be big enough to fall into this category.

 I/O is the most costly part of any of these operations.  My rule of thumb
 for MySQL SQL statements is:  If everything is cached, the query will run
 ten times as fast as it would if things have to be fetched from disk.

 Sortmerge works this way:

 1.   Sort as much of the file as you can in RAM.  Write that sorted
 piece to disk.

 2.   Repeat for the next chunk of the file.  Repeat until the input
 file is broken into sorted chunks.

 3.   Now, “merge” those chunks together �C take the first row from
  each, decide which is the “smallest”, send it to the output

 4.   Repeat until finished with all the pieces.
 For a really big task, there may have to be more than on “merge” pass.
 Note how sort merge reads the input sequentially once, writes the output
 sequentially once, and has sequential I/O for each merge chunk.
 “Sequential” I/O is faster than “random” I/O �C no arm motion on
 traditional disks.  (SSDs are a different matter; I won’t go into that.)

 The “output” from the sortmerge is fed into code that builds the BTree for
 the table.  This building of the BTree is sequential �C fill the first
 block, move on to the next block, and never have to go back.

 BTrees (when built randomly), if they need to spill to disk, will involve
 random I/O.  (And we are talking about an INDEX that is so big that it
 needs to spill to disk.)

 When a block “splits”, one full block becomes two half-full blocks.
  Randomly filling a BTree leads to, on average, the index being 69% full.
  This is not a big factor in the overall issue, but perhaps worth noting.

 How bad can it get?  Here’s an example.

 ・ You have an INDEX on some random value, such as a GUID or MD5.

 ・ The INDEX will be 5 times as big as you can fit in RAM.

 ・ MySQL is adding to the BTree one row at a time (the
 non-sortmerge way)
 When it is nearly finished, only 1 of 5 updates to the BTree can be done
 immediately in RAM; 4 out of 5 updates to the BTree will have to hit disk.
  If you are using normal disks, that is on the order of 125 rows per second
 that you can insert �C Terrible!  Sortmerge is likely to average over 10,000.



 From: Zhangzhigang [mailto:zzgang_2...@yahoo.com.cn]
 Sent: Tuesday, May 08, 2012 9:13 PM
 To: Rick James
 Cc: mysql@lists.mysql.com
 Subject: 回复： Why is creating indexes faster after inserting massive data
 rows?

 James...
 * By doing all the indexes after building the table (or at least all the
 non-UNIQUE indexes), sort merge can be used.  This technique had been
 highly optimized over the past half-century, and is more efficient.

 I have a question about sort merge:

 Why does it do the all sort merge?

 In my opinion, it just maintains the B tree and inserts one key into a B
 tree node which has fewer sorted keys, so it is good performance.

 If it only does the sort merge, the B tree data structure have to been
 created separately. it wastes some performance.

 Does it?


 
 发件人： Rick James rja...@yahoo-inc.commailto:rja...@yahoo-inc.com
 收件人： Johan De Meersman vegiv...@tuxera.bemailto:vegiv...@tuxera.be;
 Zhangzhigang zzgang_2...@yahoo.com.cnmailto:zzgang_2...@yahoo.com.cn
 抄送： mysql@lists.mysql.commailto:mysql@lists.mysql.com 
 mysql@lists.mysql.commailto:mysql@lists.mysql.com
 发送日期： 2012年5月8日, 星期二, 上午 12:35
 主题: RE: Why is creating indexes faster after inserting massive data rows?

 * Batch INSERTs run faster than one-row-at-a-time, but this is unrelated
 to INDEX updating speed.
 * The cache size is quite important to dealing with indexing during
 INSERT; see http://mysql.rjweb.org/doc.php/memory 
 http://mysql.rjweb.org/doc.php/memory%0A
 * Note that mysqldump sets up for an efficient creation of indexes after
 loading the data.  This is not practical (or necessarily efficient) when
 incremental INSERTing into a table.

 As for the original question...
 * Updating the index(es) for one row often involves random BTree
 traversals.  When the index(es) are too big to be cached, this can involve
 disk hit(s) for each row inserted.
 * By doing all the indexes after building the table (or at least all the
 non-UNIQUE indexes), sort merge can be used.  This technique had been
 highly optimized over the past half-century, and is more efficient.


  -Original Message-
  From: Johan De Meersman [mailto:vegiv...@tuxera.bemailto:
 vegiv...@tuxera.be]
  Sent: Monday, May 07, 2012 1:29 AM
  To: Zhangzhigang
  Cc:

RE: 回复： Why is creating indexes faster after inserting massive data rows?

2012-05-09 Thread Claudio Nanni

Disagree all the way, numbers are numbers, and better than words, always.
Claudio
On May 9, 2012 7:22 PM, Rick James rja...@yahoo-inc.com wrote:

 Numbers can be misleading �C one benchmark will show no difference; another
 will show 10x difference.

 Recommend you benchmark _*your*_ case.

 ** **

 *From:* Claudio Nanni [mailto:claudio.na...@gmail.com]
 *Sent:* Wednesday, May 09, 2012 8:34 AM
 *To:* Rick James
 *Cc:* Zhangzhigang; mysql@lists.mysql.com
 *Subject:* Re: 回复： Why is creating indexes faster after inserting massive
 data rows?

 ** **

 This thread is going on and on and on and on,

 does anyone have time to actually measure I/O?

 Let's make numbers talk.

 ** **

 Claudio

 ** **

 2012/5/9 Rick James rja...@yahoo-inc.com

 A BTree that is small enough to be cached in RAM can be quickly
 maintained.  Even the “block splits” are not too costly without the I/O.

 A big file that needs sorting �C bigger than can be cached in RAM �C is more
 efficiently done with a dedicated “sort merge” program.  A “big” INDEX on a
 table may be big enough to fall into this category.

 I/O is the most costly part of any of these operations.  My rule of thumb
 for MySQL SQL statements is:  If everything is cached, the query will run
 ten times as fast as it would if things have to be fetched from disk.

 Sortmerge works this way:

 1.   Sort as much of the file as you can in RAM.  Write that sorted
 piece to disk.

 2.   Repeat for the next chunk of the file.  Repeat until the input
 file is broken into sorted chunks.

 3.   Now, “merge” those chunks together �C take the first row from
  each, decide which is the “smallest”, send it to the output

 4.   Repeat until finished with all the pieces.
 For a really big task, there may have to be more than on “merge” pass.
 Note how sort merge reads the input sequentially once, writes the output
 sequentially once, and has sequential I/O for each merge chunk.
 “Sequential” I/O is faster than “random” I/O �C no arm motion on
 traditional disks.  (SSDs are a different matter; I won’t go into that.)

 The “output” from the sortmerge is fed into code that builds the BTree for
 the table.  This building of the BTree is sequential �C fill the first
 block, move on to the next block, and never have to go back.

 BTrees (when built randomly), if they need to spill to disk, will involve
 random I/O.  (And we are talking about an INDEX that is so big that it
 needs to spill to disk.)

 When a block “splits”, one full block becomes two half-full blocks.
  Randomly filling a BTree leads to, on average, the index being 69% full.
  This is not a big factor in the overall issue, but perhaps worth noting.

 How bad can it get?  Here’s an example.

 ・ You have an INDEX on some random value, such as a GUID or MD5.

 ・ The INDEX will be 5 times as big as you can fit in RAM.

 ・ MySQL is adding to the BTree one row at a time (the
 non-sortmerge way)
 When it is nearly finished, only 1 of 5 updates to the BTree can be done
 immediately in RAM; 4 out of 5 updates to the BTree will have to hit disk.
  If you are using normal disks, that is on the order of 125 rows per second
 that you can insert �C Terrible!  Sortmerge is likely to average over 10,000.



 From: Zhangzhigang [mailto:zzgang_2...@yahoo.com.cn]
 Sent: Tuesday, May 08, 2012 9:13 PM
 To: Rick James
 Cc: mysql@lists.mysql.com
 Subject: 回复： Why is creating indexes faster after inserting massive data
 rows?


 James...
 * By doing all the indexes after building the table (or at least all the
 non-UNIQUE indexes), sort merge can be used.  This technique had been
 highly optimized over the past half-century, and is more efficient.

 I have a question about sort merge:

 Why does it do the all sort merge?

 In my opinion, it just maintains the B tree and inserts one key into a B
 tree node which has fewer sorted keys, so it is good performance.

 If it only does the sort merge, the B tree data structure have to been
 created separately. it wastes some performance.

 Does it?


 
 发件人： Rick James rja...@yahoo-inc.commailto:rja...@yahoo-inc.com
 收件人： Johan De Meersman vegiv...@tuxera.bemailto:vegiv...@tuxera.be;
 Zhangzhigang zzgang_2...@yahoo.com.cnmailto:zzgang_2...@yahoo.com.cn
 抄送： mysql@lists.mysql.commailto:mysql@lists.mysql.com 
 mysql@lists.mysql.commailto:mysql@lists.mysql.com

 发送日期： 2012年5月8日, 星期二, 上午 12:35
 主题: RE: Why is creating indexes faster after inserting massive data rows?

 * Batch INSERTs run faster than one-row-at-a-time, but this is unrelated
 to INDEX updating speed.

 * The cache size is quite important to dealing with indexing during
 INSERT; see http://mysql.rjweb.org/doc.php/memory 
 http://mysql.rjweb.org/doc.php/memory%0A

 * Note that mysqldump sets up for an efficient creation of indexes after
 loading the data.  This is not practical (or necessarily efficient) when
 incremental

Re: 回复： Why is creating indexes faster after inserting massive data rows?

2012-05-07 Thread Alex Schaft


On 2012/05/07 10:53, Zhangzhigang wrote:

johan 

Plain and simple: the indices get updated after every insert statement,

whereas if you only create the index *after* the inserts, the index gets 
created in a single operation, which is a lot more efficient..


Ok, Creating the index *after* the inserts, the index gets created in a single 
operation.
But the indexes has to be updating row by row after the data rows has all been 
inserted. Does it work in this way?
So i can not find the different overhead  about two ways.
My simplified 2c. When inserting rows with active indexes one by one 
(insert), mysql has to


1) lock the space for the data to be added,
2) write the data,
3) lock the index,
4) write the index key(s),
5) unlock the index,
6)unlock the data

This happens for each row

When first doing all data without index, only 1, 2, and 6 happen. When 
you then create an index, it can lock the index, read all the data and 
write all index keys in one go and then unlock the index.


If you make an omelet, do you fetch your eggs from the fridge one by 
one, or all at the same time? :)


HTH,
Alex


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql

Re: 回复： Why is creating indexes faster after inserting massive data rows?

2012-05-07 Thread Claudio Nanni

Creating the index in one time is one macro-sort operation,
updating the index at every row is doing the operation on and on again.
If you do not understand the difference I recommend you to read some basics
about sorting algorithms,
very interesting read anyway.

Claudio

2012/5/7 Zhangzhigang zzgang_2...@yahoo.com.cn

 johan 
 Plain and simple: the indices get updated after every insert statement,
 whereas if you only create the index *after* the inserts, the index gets
 created in a single operation, which is a lot more efficient..


 Ok, Creating the index *after* the inserts, the index gets created in a
 single operation.
 But the indexes has to be updating row by row after the data rows has all
 been inserted. Does it work in this way?
 So i can not find the different overhead  about two ways.





 
  发件人： Johan De Meersman vegiv...@tuxera.be
 收件人： Zhangzhigang zzgang_2...@yahoo.com.cn
 抄送： mysql@lists.mysql.com
 发送日期： 2012年5月7日, 星期一, 下午 4:28
 主题: Re: Why is creating indexes faster after inserting massive data rows?

 - Original Message -
  From: Zhangzhigang zzgang_2...@yahoo.com.cn
 
  Creating indexes after inserting massive data rows is faster than
  before inserting data rows.
  Please tell me why.

 Plain and simple: the indices get updated after every insert statement,
 whereas if you only create the index *after* the inserts, the index gets
 created in a single operation, which is a lot more efficient.

 I seem to recall that inside of a transaction (thus, InnoDB or so) the
 difference is markedly less; I might be wrong, though.


 --
 Bier met grenadyn
 Is als mosterd by den wyn
 Sy die't drinkt, is eene kwezel
 Hy die't drinkt, is ras een ezel




-- 
Claudio

Re: 回复： Why is creating indexes faster after inserting massive data rows?

2012-05-07 Thread Johan De Meersman

- Original Message -

 From: Zhangzhigang zzgang_2...@yahoo.com.cn

 Ok, Creating the index *after* the inserts, the index gets created in
 a single operation.
 But the indexes has to be updating row by row after the data rows has
 all been inserted. Does it work in this way?
No, when you create an index on an existing table (like after a mass insert), 
what happens is that the engine does a single full tablescan and builds the 
index in a single pass, which is a lot more performant than updating a single 
disk block for every record, for the simple reason that a single disk block can 
contain dozens of index entries. 

Imagine that you insert one million rows, and you have 100 index entries in a 
disk block (random numbers, to make a point. Real numbers will depend on 
storage, file system, index, et cetera). Obviously there's no way to write less 
than a single block to disk - that's how it works. 

You can update your index for each record in turn. That means you will need to 
do 1 million index - and thus block - writes; plus additional reads for those 
blocks you don't have in memory - that's the index cache. 

Now, if you create a new index on an existing table, you are first of all 
bypassing any index read operations - there *is* no index to read, yet. Then 
the system is going to do a full tablescan - considered slow, but you need all 
the data, so there's no better way anyway. The index will be built - in-memory 
as much as possible - and the system will automatically prefer to write only 
complete blocks - 10.000 of them. That's the exact same number of index blocks, 
but you only write each block once, so that's only 10.000 writes instead of 
1.000.000. 

Now there's a lot more at play, things like B-tree balancing and whatnot, but 
that's the basic picture. 

-- 

Bier met grenadyn 
Is als mosterd by den wyn 
Sy die't drinkt, is eene kwezel 
Hy die't drinkt, is ras een ezel

RE: 回复： Why is creating indexes faster after inserting massive data rows?

2012-05-07 Thread Rick James

As a side note, TokuDB uses what it calls fractal technology to somewhat 
improve the performance of incremental INDEXing.  They delay some of the BTree 
work so that they can better batch stuff.  While waiting for that to finish, 
queries are smart enough to look in more than one place for the index info.

InnoDB does something similar, but it is limited to the size of the buffer_pool.

 -Original Message-
 From: Johan De Meersman [mailto:vegiv...@tuxera.be]
 Sent: Monday, May 07, 2012 8:06 AM
 To: Zhangzhigang
 Cc: mysql@lists.mysql.com
 Subject: Re: 回复： Why is creating indexes faster after inserting
 massive data rows?
 
 - Original Message -
 
  From: Zhangzhigang zzgang_2...@yahoo.com.cn
 
  Ok, Creating the index *after* the inserts, the index gets created in
  a single operation.
  But the indexes has to be updating row by row after the data rows has
  all been inserted. Does it work in this way?
 No, when you create an index on an existing table (like after a mass
 insert), what happens is that the engine does a single full tablescan
 and builds the index in a single pass, which is a lot more performant
 than updating a single disk block for every record, for the simple
 reason that a single disk block can contain dozens of index entries.
 
 Imagine that you insert one million rows, and you have 100 index
 entries in a disk block (random numbers, to make a point. Real numbers
 will depend on storage, file system, index, et cetera). Obviously
 there's no way to write less than a single block to disk - that's how
 it works.
 
 You can update your index for each record in turn. That means you will
 need to do 1 million index - and thus block - writes; plus additional
 reads for those blocks you don't have in memory - that's the index
 cache.
 
 Now, if you create a new index on an existing table, you are first of
 all bypassing any index read operations - there *is* no index to read,
 yet. Then the system is going to do a full tablescan - considered slow,
 but you need all the data, so there's no better way anyway. The index
 will be built - in-memory as much as possible - and the system will
 automatically prefer to write only complete blocks - 10.000 of them.
 That's the exact same number of index blocks, but you only write each
 block once, so that's only 10.000 writes instead of 1.000.000.
 
 Now there's a lot more at play, things like B-tree balancing and
 whatnot, but that's the basic picture.
 
 --
 
 Bier met grenadyn
 Is als mosterd by den wyn
 Sy die't drinkt, is eene kwezel
 Hy die't drinkt, is ras een ezel

Re: 回复： Why is creating indexes faster after inserting massive data rows?

2012-05-07 Thread Karen Abgarian

Hi, 

A couple cents to this. 

There isn't really a million of block writes.   The record gets added to the 
block, but that gets modified in OS cache if we assume MyISAM tables and in the 
Innodb buffer if we assume InnoDB tables.   In both cases, the actual writing 
does not take place and does not slow down the process.What does however 
happen for each operation, is processing the statement, locating the entries to 
update in the index, index block splits and , for good reason, committing.   

When it comes to creating an index, what needs to happen, is to read the whole 
table and to sort all rows by the index key.   The latter process will be the 
most determining factor in answering the original question, because for the 
large tables the sort will have to do a lot of disk I/O.The point I am 
trying to make is there will be situations when creating indexes and then 
inserting the rows will be faster than creating an index afterwards.   If we 
try to determine such situations, we could notice that the likelihood of the 
sort going to disk increases with the amount of distinct values to be sorted.   
For this reason, my choice would be to create things like primary/unique keys 
beforehand unless I am certain that everything will fit in the available 
memory. 

Peace
Karen



On May 7, 2012, at 8:05 AM, Johan De Meersman wrote:

 - Original Message -
 
 From: Zhangzhigang zzgang_2...@yahoo.com.cn
 
 Ok, Creating the index *after* the inserts, the index gets created in
 a single operation.
 But the indexes has to be updating row by row after the data rows has
 all been inserted. Does it work in this way?
 No, when you create an index on an existing table (like after a mass insert), 
 what happens is that the engine does a single full tablescan and builds the 
 index in a single pass, which is a lot more performant than updating a single 
 disk block for every record, for the simple reason that a single disk block 
 can contain dozens of index entries. 
 
 Imagine that you insert one million rows, and you have 100 index entries in a 
 disk block (random numbers, to make a point. Real numbers will depend on 
 storage, file system, index, et cetera). Obviously there's no way to write 
 less than a single block to disk - that's how it works. 
 
 You can update your index for each record in turn. That means you will need 
 to do 1 million index - and thus block - writes; plus additional reads for 
 those blocks you don't have in memory - that's the index cache. 
 
 Now, if you create a new index on an existing table, you are first of all 
 bypassing any index read operations - there *is* no index to read, yet. Then 
 the system is going to do a full tablescan - considered slow, but you need 
 all the data, so there's no better way anyway. The index will be built - 
 in-memory as much as possible - and the system will automatically prefer to 
 write only complete blocks - 10.000 of them. That's the exact same number of 
 index blocks, but you only write each block once, so that's only 10.000 
 writes instead of 1.000.000. 
 
 Now there's a lot more at play, things like B-tree balancing and whatnot, but 
 that's the basic picture. 
 
 -- 
 
 Bier met grenadyn 
 Is als mosterd by den wyn 
 Sy die't drinkt, is eene kwezel 
 Hy die't drinkt, is ras een ezel 


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql

Re: 回复： Why is creating indexes faster after inserting massive data rows?

2012-05-07 Thread Karen Abgarian

Good point about key buffer.   I was only thinking about the table updates for 
MyISAM, not indexes.   The being stuck waiting for buffer flush could also 
happen.  However, for the table blocks this would be the same issue as with 
load followed by index rebuild, and for the indexes, it will have to be 
compared, performance-wise, with an expense of sorting an equally sized index.  
 

On May 7, 2012, at 10:40 AM, Rick James wrote:

 (Correction to Karen's comments)
 * MyISAM does all its index operations in the key_buffer, similar to InnoDB 
 and its buffer_pool.
 * Yes, writes are delayed (in both engines), but not forever.  If the table 
 is huge, you will eventually be stuck waiting for blocks to be flushed from 
 cache.
 * If the table is small enough, all the I/O can be delayed, and done only 
 once.  So yes, the in-memory cache may be faster. 
 
 Based on this discussion, you should note that random indexes, such as 
 GUIDs, MD5s, etc, tend to 
 
 
 -Original Message-
 From: Karen Abgarian [mailto:a...@apple.com]
 Sent: Monday, May 07, 2012 10:31 AM
 To: mysql@lists.mysql.com
 Subject: Re: 回复： Why is creating indexes faster after inserting
 massive data rows?
 
 Hi,
 
 A couple cents to this.
 
 There isn't really a million of block writes.   The record gets added
 to the block, but that gets modified in OS cache if we assume MyISAM
 tables and in the Innodb buffer if we assume InnoDB tables.   In both
 cases, the actual writing does not take place and does not slow down
 the process.What does however happen for each operation, is
 processing the statement, locating the entries to update in the index,
 index block splits and , for good reason, committing.
 
 When it comes to creating an index, what needs to happen, is to read
 the whole table and to sort all rows by the index key.   The latter
 process will be the most determining factor in answering the original
 question, because for the large tables the sort will have to do a lot
 of disk I/O.The point I am trying to make is there will be
 situations when creating indexes and then inserting the rows will be
 faster than creating an index afterwards.   If we try to determine such
 situations, we could notice that the likelihood of the sort going to
 disk increases with the amount of distinct values to be sorted.   For
 this reason, my choice would be to create things like primary/unique
 keys beforehand unless I am certain that everything will fit in the
 available memory.
 
 Peace
 Karen
 
 
 
 On May 7, 2012, at 8:05 AM, Johan De Meersman wrote:
 
 - Original Message -
 
 From: Zhangzhigang zzgang_2...@yahoo.com.cn
 
 Ok, Creating the index *after* the inserts, the index gets created
 in
 a single operation.
 But the indexes has to be updating row by row after the data rows
 has
 all been inserted. Does it work in this way?
 No, when you create an index on an existing table (like after a mass
 insert), what happens is that the engine does a single full tablescan
 and builds the index in a single pass, which is a lot more performant
 than updating a single disk block for every record, for the simple
 reason that a single disk block can contain dozens of index entries.
 
 Imagine that you insert one million rows, and you have 100 index
 entries in a disk block (random numbers, to make a point. Real numbers
 will depend on storage, file system, index, et cetera). Obviously
 there's no way to write less than a single block to disk - that's how
 it works.
 
 You can update your index for each record in turn. That means you
 will need to do 1 million index - and thus block - writes; plus
 additional reads for those blocks you don't have in memory - that's the
 index cache.
 
 Now, if you create a new index on an existing table, you are first of
 all bypassing any index read operations - there *is* no index to read,
 yet. Then the system is going to do a full tablescan - considered slow,
 but you need all the data, so there's no better way anyway. The index
 will be built - in-memory as much as possible - and the system will
 automatically prefer to write only complete blocks - 10.000 of them.
 That's the exact same number of index blocks, but you only write each
 block once, so that's only 10.000 writes instead of 1.000.000.
 
 Now there's a lot more at play, things like B-tree balancing and
 whatnot, but that's the basic picture.
 
 --
 
 Bier met grenadyn
 Is als mosterd by den wyn
 Sy die't drinkt, is eene kwezel
 Hy die't drinkt, is ras een ezel
 
 
 --
 MySQL General Mailing List
 For list archives: http://lists.mysql.com/mysql
 To unsubscribe:http://lists.mysql.com/mysql
 


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql

RE: 回复： Why is creating indexes faster after inserting massive data rows?

Re: 回复： Why is creating indexes faster after inserting massive data rows?

RE: 回复： Why is creating indexes faster after inserting massive data rows?

Re: 回复： Why is creating indexes faster after inserting massive data rows?

Re: 回复： Why is creating indexes faster after inserting massive data rows?

Re: 回复： Why is creating indexes faster after inserting massive data rows?

RE: 回复： Why is creating indexes faster after inserting massive data rows?

Re: 回复： Why is creating indexes faster after inserting massive data rows?

Re: 回复： Why is creating indexes faster after inserting massive data rows?

9 matches

Site Navigation

Mail list logo

Footer information