Re: Loading of large amounts of data

Shawn Green (MySQL) Wed, 07 Dec 2011 05:17:56 -0800

Hello Machiel,

On 12/6/2011 01:40, Machiel Richards - Gmail wrote:

Good day all


I have someone who has asked me the following, however due to not having
that many years of experience in these type of volumes, I am posting
this as I know that someone will probably be able to answer it better
than me.

(This will also give me a learning opportunity to see what to do)

_*Client Question: *_

Well let me describe the issue.

1.I require to load records into a MySQL database table - no problem so
far ;-)

2.The table represents "stock" that will be being searched and
transacted (i.e. sold, which involves changing flags on the record) by a
live system.

3.The stock table will be big --millions or tens of millions of rows

4.Stock is uniquely identified by two fields -- a supplier ID (numeric)
and a serial number (varchar)

5.Transaction volumes may also be very high

6.Stock must be available to the system 24/7

7.I will need to replenish the stock table from a file, one or more
times a day -- potentially loading tens or hundreds of thousands of rows
each time

8.The DB will be a master-slave: reporting and recon files off the
slave, transactions off the master (and presumably replenishment into
master)

I can go into a lot more detail about the process I am using (using an
ETL tool called Talend) ... but the essential question is around
strategies for doing this kind of dynamic loading:

1.How to insert data (high volumes) into the live table without locking
it and affecting transaction performance (Insert low_priority?)

2.How to speed up inserts, even when there are two unique key
constraints. My observation is obvious -- that inserts get slower and
slower as the table grows (date based partitions of some kind maybe?).

3.General principles/ strategies in dealing with situations like this.



Can someone please assist.

I can't give you precise details but I can point you in the rightdirections. Your requirements are well-formed but they tend tocontradict each other. While there are no ways to completely remove thecontradictions, there are ways to minimize their impacts.


#5 High transaction volumes
#6 Available 24x7
#1,#7 Bulk updates of 10000+ records daily

These three are in conflict. Database changes require indexes to berebuilt. Index rebuilds can be fast (for small changes or small indexes)or take a noticeable length of time (larger changes or larger indexes orboth). This means you may need to have two systems you flip-flop intoplace to minimize your downtime. The same problem was solved by graphicscard manufacturers by creating multiple frame buffers. You can make your'unavailability' time as short as possible by updating a passive copy ofthe data while it is not being pointed to by your application front-endthen swapping the 'updated' set of data for the 'old' set of data byeither altering the virtual IPs of your sets of instances or byredirecting which set your applications are pulling data from.


#8 System will be master-slave

My flip-flop idea implies that your system will have two sets ofmaster-slave(s) one carrying the 'current' data and one used to buildthe 'new' set of data (with the imports). This also implies that your'active' set will need to be replicating to your 'passive' set to keepit in sync between bulk updates.


#2a Many records need to change in a day
#3 There will be millions of records
#2b Searches need to be fast

These conflict with each other too. The more records you add to a table,the longer any indexed lookup will take. If you can't use the data inmemory in the index then a trip to the disk will be necessary toretrieve the columns for your query. Combine this with the number ofqueries at any one time and divide that by the maximum number ofrandom-access reads a physical disk can achieve and you may easilyexceed the capacity of any one disk storage system to supply. Thisimplies that you need to look at how to divide your storage amongseveral independent devices at the same time. Options abound: sharding,partitioning, simple configuration changes (some tablespaces on onedevice, some on others). Or, you can look at pricing solid-state disksfor your storage neeeds. Factoring in need #4, this suggests that apartitioning scheme based on (supplier, serial#) may be a good firstdesign choice.

So... After discussing the pain points of each of your requirements Ihave the following mental image of a system:

a) two sets of master-slaves. The master of the passive set will be a'slave' to the master of the active set.

b) data on each set is using InnoDB

c) data partitioned on the stock table based on (serial#, supplier) - Ichose that order because I think it will give a better random spreadamong the partition tables and because I think it will be much morecommon to ask 'which suppliers have part XXX' than it will be to say'what are all the parts that supplier YYY has'.

As always, take this advice with a grain of salt and adjust thispossible design based on any other factors you did not include in yourlist of requirements. It may even be possible (depending on the size ofyour rows and other factors) that MySQL Cluster might be a better fitfor your requirements. I encourage you to engage with Cluster sales orany reputable consultant to get an evaluation and their recommendation,too. (disclaimer: I am not a cluster guru). I also encourage you to seekmultiple recommendations. Many different solutions to the same problemsyou describe have been created by many different people. What works inmy mind may not work in all situations.


Regards,
--
Shawn Green
MySQL Principal Technical Support Engineer
Oracle USA, Inc. - Hardware and Software, Engineered to Work Together.
Office: Blountville, TN

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/mysql

Re: Loading of large amounts of data

Reply via email to