Hello Machiel,

On 12/6/2011 01:40, Machiel Richards - Gmail wrote:
Good day all

I have someone who has asked me the following, however due to not having
that many years of experience in these type of volumes, I am posting
this as I know that someone will probably be able to answer it better
than me.

(This will also give me a learning opportunity to see what to do)

_*Client Question: *_

Well let me describe the issue.

1.I require to load records into a MySQL database table - no problem so
far ;-)

2.The table represents "stock" that will be being searched and
transacted (i.e. sold, which involves changing flags on the record) by a
live system.

3.The stock table will be big --millions or tens of millions of rows

4.Stock is uniquely identified by two fields -- a supplier ID (numeric)
and a serial number (varchar)

5.Transaction volumes may also be very high

6.Stock must be available to the system 24/7

7.I will need to replenish the stock table from a file, one or more
times a day -- potentially loading tens or hundreds of thousands of rows
each time

8.The DB will be a master-slave: reporting and recon files off the
slave, transactions off the master (and presumably replenishment into
master)

I can go into a lot more detail about the process I am using (using an
ETL tool called Talend) ... but the essential question is around
strategies for doing this kind of dynamic loading:

1.How to insert data (high volumes) into the live table without locking
it and affecting transaction performance (Insert low_priority?)

2.How to speed up inserts, even when there are two unique key
constraints. My observation is obvious -- that inserts get slower and
slower as the table grows (date based partitions of some kind maybe?).

3.General principles/ strategies in dealing with situations like this.



Can someone please assist.


I can't give you precise details but I can point you in the right directions. Your requirements are well-formed but they tend to contradict each other. While there are no ways to completely remove the contradictions, there are ways to minimize their impacts.

#5 High transaction volumes
#6 Available 24x7
#1,#7 Bulk updates of 10000+ records daily

These three are in conflict. Database changes require indexes to be rebuilt. Index rebuilds can be fast (for small changes or small indexes) or take a noticeable length of time (larger changes or larger indexes or both). This means you may need to have two systems you flip-flop into place to minimize your downtime. The same problem was solved by graphics card manufacturers by creating multiple frame buffers. You can make your 'unavailability' time as short as possible by updating a passive copy of the data while it is not being pointed to by your application front-end then swapping the 'updated' set of data for the 'old' set of data by either altering the virtual IPs of your sets of instances or by redirecting which set your applications are pulling data from.

#8 System will be master-slave
My flip-flop idea implies that your system will have two sets of master-slave(s) one carrying the 'current' data and one used to build the 'new' set of data (with the imports). This also implies that your 'active' set will need to be replicating to your 'passive' set to keep it in sync between bulk updates.

#2a Many records need to change in a day
#3 There will be millions of records
#2b Searches need to be fast

These conflict with each other too. The more records you add to a table, the longer any indexed lookup will take. If you can't use the data in memory in the index then a trip to the disk will be necessary to retrieve the columns for your query. Combine this with the number of queries at any one time and divide that by the maximum number of random-access reads a physical disk can achieve and you may easily exceed the capacity of any one disk storage system to supply. This implies that you need to look at how to divide your storage among several independent devices at the same time. Options abound: sharding, partitioning, simple configuration changes (some tablespaces on one device, some on others). Or, you can look at pricing solid-state disks for your storage neeeds. Factoring in need #4, this suggests that a partitioning scheme based on (supplier, serial#) may be a good first design choice.

So... After discussing the pain points of each of your requirements I have the following mental image of a system:

a) two sets of master-slaves. The master of the passive set will be a 'slave' to the master of the active set.
b) data on each set is using InnoDB
c) data partitioned on the stock table based on (serial#, supplier) - I chose that order because I think it will give a better random spread among the partition tables and because I think it will be much more common to ask 'which suppliers have part XXX' than it will be to say 'what are all the parts that supplier YYY has'.

As always, take this advice with a grain of salt and adjust this possible design based on any other factors you did not include in your list of requirements. It may even be possible (depending on the size of your rows and other factors) that MySQL Cluster might be a better fit for your requirements. I encourage you to engage with Cluster sales or any reputable consultant to get an evaluation and their recommendation, too. (disclaimer: I am not a cluster guru). I also encourage you to seek multiple recommendations. Many different solutions to the same problems you describe have been created by many different people. What works in my mind may not work in all situations.

Regards,
--
Shawn Green
MySQL Principal Technical Support Engineer
Oracle USA, Inc. - Hardware and Software, Engineered to Work Together.
Office: Blountville, TN

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/mysql

Reply via email to