Re: splitting large tables vertically
kimky...@fhda.edu (Kyong Kim) writes: I was wondering about a scale out problem. Lets say you have a large table with 3 cols and 500+ million rows. Would there be much benefit in splitting the columns into different tables based on INT type primary keys across the tables? To answer your question properly requires more information: 1. Expected table structure. Can you show the current CREATE TABLE xxx\G output? 2. Expected use cases to extract data? 3. Do you expect to delete data frequently, or are you only inserting data, or is there a mix of inserts and deletes? If so provide more info. I've come across situations where a large table like this caused lots of problems. There were lots of concurrent delete batches (cleaning up) and at the same time lot of inserts. At the same time there were large groups of selects to collect certain sets of data for presentation. Perhaps you are doing something similar? If you do something similar you may find that it's extremely important to get the keys right especially the primary keys so that data retrieval (for SELECTs or DELETEs) is as fast as possible (using clustered indexes [PRIMARY KEY in innodb]). If not or if the queries overlap you may find performance degredation a big issue as Innobase manages the locks to ensure that the concurrent statements don't interfere. You can also use merge tables sitting on top of MyISAM per year or per whatever data in each table. That avoids you having to find data for 2009 as you look in table xxx_2009, so this can be a big win. MyISAM has the inconvenience that if the server ever crashes recovery of these tables can be very timeconsuming. Innodb has a larger footprint for the same data. So it's hard without more information on the structure and the use cases to answer your question. In fact if you have the time, try out and benchmark different approaches and see which is best for your requirements. Just remember that as the data grows the initial measurements may not be consistent with behaviour you see later. Also if you are looking at a large amount of data like this appropriate server tuning can influence performance significantly. Hope this helps. Simon -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org
Re: splitting large tables vertically
Simon, Thanks for the feedback. I don't have all the details of the schema and workload. Just an interesting idea that was presented to me. I think the idea is to split a lengthy secondary key lookup into 2 primary key lookups and reduce the cost of clustering secondary key with primary key data by using a shorter INT type surrogate key. Another downside is the possible need of foreign keys and added complexity of insertions and multi-column updates. Have you found primary key lookups to be at least twice as fast as secondary key lookups with VARCHAR type primary key in InnoDB? The whole idea is based on the assumption that it is. Also, MyISAM conversion is an option too. Have you found the table maintenance to be a significant overhead? I've experienced MyISAM table corruptions in production and I'm more inclined to go with InnoDB for its reliability. This is a fairly important table. Any insight would be much appreciated. Kyong kimky...@fhda.edu (Kyong Kim) writes: I was wondering about a scale out problem. Lets say you have a large table with 3 cols and 500+ million rows. Would there be much benefit in splitting the columns into different tables based on INT type primary keys across the tables? To answer your question properly requires more information: 1. Expected table structure. Can you show the current CREATE TABLE xxx\G output? 2. Expected use cases to extract data? 3. Do you expect to delete data frequently, or are you only inserting data, or is there a mix of inserts and deletes? If so provide more info. I've come across situations where a large table like this caused lots of problems. There were lots of concurrent delete batches (cleaning up) and at the same time lot of inserts. At the same time there were large groups of selects to collect certain sets of data for presentation. Perhaps you are doing something similar? If you do something similar you may find that it's extremely important to get the keys right especially the primary keys so that data retrieval (for SELECTs or DELETEs) is as fast as possible (using clustered indexes [PRIMARY KEY in innodb]). If not or if the queries overlap you may find performance degredation a big issue as Innobase manages the locks to ensure that the concurrent statements don't interfere. You can also use merge tables sitting on top of MyISAM per year or per whatever data in each table. That avoids you having to find data for 2009 as you look in table xxx_2009, so this can be a big win. MyISAM has the inconvenience that if the server ever crashes recovery of these tables can be very timeconsuming. Innodb has a larger footprint for the same data. So it's hard without more information on the structure and the use cases to answer your question. In fact if you have the time, try out and benchmark different approaches and see which is best for your requirements. Just remember that as the data grows the initial measurements may not be consistent with behaviour you see later. Also if you are looking at a large amount of data like this appropriate server tuning can influence performance significantly. Hope this helps. Simon -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=kimky...@fhda.edu Inst. Web Programmer CMDBA 5.0 -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org
Re: splitting large tables vertically
kimky...@fhda.edu (Kyong Kim) writes: I don't have all the details of the schema and workload. Just an interesting idea that was presented to me. I think the idea is to split a lengthy secondary key lookup into 2 primary key lookups and reduce the cost of clustering secondary key with primary key data by using a shorter INT type surrogate key. Another downside is the possible need of foreign keys and added complexity of insertions and multi-column updates. Have you found primary key lookups to be at least twice as fast as secondary key lookups with VARCHAR type primary key in InnoDB? The whole idea is based on the assumption that it is. That's why you really need to be more precise in the data structures you are planning on using. This can change the results significantly. So no, I don't have any specific answers to your questions as you don't provide any specific information in what you ask. Also, MyISAM conversion is an option too. Have you found the table maintenance to be a significant overhead? I've experienced MyISAM table corruptions in production and I'm more inclined to go with InnoDB for its reliability. This is a fairly important table. Well disk (and memory) usage can also be important so as it seems InnoDB storage is less efficient this may actually degrade performance. Until you are more concrete it's hard to say what will work best for you. Simon -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org
Re: splitting large tables vertically
That's why you really need to be more precise in the data structures you are planning on using. This can change the results significantly. So no, I don't have any specific answers to your questions as you don't provide any specific information in what you ask. Yeah. Let me see if I can follow up with more concrete information sometime in future. I find performance tuning to be workload dependent and it is difficult to project without having all the details. Well disk (and memory) usage can also be important so as it seems InnoDB storage is less efficient this may actually degrade performance. Until you are more concrete it's hard to say what will work best for you. At this point I'm fairly convinced that this idea of vertical paritioning a table into column tables will degrade performance unless the workload is tailor-made for this. The cost of joins and index lookup/column data seems a bit too high for almost any scenario. Thanks for the prompt response. I'll follow up with you if I have more concrete details. Thanks Kyong -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=kimky...@fhda.edu Inst. Web Programmer CMDBA 5.0 -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org
Re: splitting large tables vertically
Do the 3 tables have different column structures? Or do they all have the same table structure? For example, is Table1 storing only data for year 1990 and table 2 storing data for 1991 etc? If so you could use a merge table. (Or do you need transactions, in which case you will need to use InnoDb and merge tables are not possible) Is there always a 1:1 relationship between the tables? If so, I'd recommend storing all the data in 1 table. Joins are really expensive and are to be avoided if at all possible possible. One way I found to make joins faster when retrieving under 5k rows, is to select the rows from each table and put them into a temporary Memory table, build the primary index on each memory table, and then use the memory tables for the join. This seems to be more work, but it is about twice as fast as joining the physical tables together. Hope this helps. If you could provide us with more information as how you will be adding and retrieving the data (how many rows are being retrieved in a query?), perhaps we can narrow down the solution for you. Mike At 07:44 AM 5/9/2009, Kyong Kim wrote: I was wondering about a scale out problem. Lets say you have a large table with 3 cols and 500+ million rows. Would there be much benefit in splitting the columns into different tables based on INT type primary keys across the tables? The split tables will be hosted on a same physical instance but can be spread over multiple disks. We're also open to splitting the query and reconstituting the data at the application layer such as select col1, col2 from t1 where col2='name'; select col2 from t2 where col1=t1.col1; select col2 from t3 where col1=t1.col1; as opposed to select t1.col2, t2.col2, t3.col2 from t1 inner join t2 on t1.col1=t2.col1 inner join t3 on t1.col1=t3.col1; My concern to this approach is the overhead of joins of such large number of rows. I was doing some research into the cost of joins and as of 5.0, the joins were still nested loop scans. I was wondering if there are others with practical experience in this matter and what they've found. Any feedback will be much appreciated. Kyong Inst. Web Programmer CMDBA 5.0 -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=mo...@fastmail.fm -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org