Re: splitting large tables vertically

2009-05-10 Thread Simon J Mudd
kimky...@fhda.edu (Kyong Kim) writes:

 I was wondering about a scale out problem.
 Lets say you have a large table with 3 cols and 500+ million rows.
 
 Would there be much benefit in splitting the columns into different tables
 based on INT type primary keys across the tables?

To answer your question properly requires more information:

1. Expected table structure. Can you show the current CREATE TABLE xxx\G output?
2. Expected use cases to extract data?
3. Do you expect to delete data frequently, or are you only inserting data,
   or is there a mix of inserts and deletes? If so provide more info.

I've come across situations where a large table like this caused lots
of problems. There were lots of concurrent delete batches (cleaning
up) and at the same time lot of inserts. At the same time there were
large groups of selects to collect certain sets of data for
presentation. Perhaps you are doing something similar? If you do
something similar you may find that it's extremely important to get
the keys right especially the primary keys so that data retrieval (for
SELECTs or DELETEs) is as fast as possible (using clustered indexes
[PRIMARY KEY in innodb]). If not or if the queries overlap you may
find performance degredation a big issue as Innobase manages the locks
to ensure that the concurrent statements don't interfere.

You can also use merge tables sitting on top of MyISAM per year or
per whatever data in each table. That avoids you having to find data
for 2009 as you look in table xxx_2009, so this can be a big
win. MyISAM has the inconvenience that if the server ever crashes
recovery of these tables can be very timeconsuming.  Innodb has a
larger footprint for the same data.

So it's hard without more information on the structure and the use
cases to answer your question. In fact if you have the time, try out
and benchmark different approaches and see which is best for your
requirements. Just remember that as the data grows the initial
measurements may not be consistent with behaviour you see later. Also
if you are looking at a large amount of data like this appropriate
server tuning can influence performance significantly.

Hope this helps.

Simon

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org



Re: splitting large tables vertically

2009-05-10 Thread Kyong Kim
Simon,
Thanks for the feedback.
I don't have all the details of the schema and workload. Just an
interesting idea that was presented to me.
I think the idea is to split a lengthy secondary key lookup into 2 primary
key lookups and reduce the cost of clustering secondary key with primary
key data by using a shorter INT type surrogate key. Another downside is
the possible need of foreign keys and added complexity of insertions and
multi-column updates.

Have you found primary key lookups to be at least twice as fast as
secondary key lookups with VARCHAR type primary key in InnoDB? The whole
idea is based on the assumption that it is.

Also, MyISAM conversion is an option too. Have you found the table
maintenance to be a significant overhead? I've experienced MyISAM table
corruptions in production and I'm more inclined to go with InnoDB for its
reliability. This is a fairly important table.

Any insight would be much appreciated.
Kyong

 kimky...@fhda.edu (Kyong Kim) writes:

 I was wondering about a scale out problem.
 Lets say you have a large table with 3 cols and 500+ million rows.

 Would there be much benefit in splitting the columns into different
 tables
 based on INT type primary keys across the tables?

 To answer your question properly requires more information:

 1. Expected table structure. Can you show the current CREATE TABLE xxx\G
 output?
 2. Expected use cases to extract data?
 3. Do you expect to delete data frequently, or are you only inserting
 data,
or is there a mix of inserts and deletes? If so provide more info.

 I've come across situations where a large table like this caused lots
 of problems. There were lots of concurrent delete batches (cleaning
 up) and at the same time lot of inserts. At the same time there were
 large groups of selects to collect certain sets of data for
 presentation. Perhaps you are doing something similar? If you do
 something similar you may find that it's extremely important to get
 the keys right especially the primary keys so that data retrieval (for
 SELECTs or DELETEs) is as fast as possible (using clustered indexes
 [PRIMARY KEY in innodb]). If not or if the queries overlap you may
 find performance degredation a big issue as Innobase manages the locks
 to ensure that the concurrent statements don't interfere.

 You can also use merge tables sitting on top of MyISAM per year or
 per whatever data in each table. That avoids you having to find data
 for 2009 as you look in table xxx_2009, so this can be a big
 win. MyISAM has the inconvenience that if the server ever crashes
 recovery of these tables can be very timeconsuming.  Innodb has a
 larger footprint for the same data.

 So it's hard without more information on the structure and the use
 cases to answer your question. In fact if you have the time, try out
 and benchmark different approaches and see which is best for your
 requirements. Just remember that as the data grows the initial
 measurements may not be consistent with behaviour you see later. Also
 if you are looking at a large amount of data like this appropriate
 server tuning can influence performance significantly.

 Hope this helps.

 Simon

 --
 MySQL General Mailing List
 For list archives: http://lists.mysql.com/mysql
 To unsubscribe:http://lists.mysql.com/mysql?unsub=kimky...@fhda.edu



Inst. Web Programmer
CMDBA 5.0

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org



Re: splitting large tables vertically

2009-05-10 Thread Simon J Mudd
kimky...@fhda.edu (Kyong Kim) writes:

 I don't have all the details of the schema and workload. Just an
 interesting idea that was presented to me.
 I think the idea is to split a lengthy secondary key lookup into 2 primary
 key lookups and reduce the cost of clustering secondary key with primary
 key data by using a shorter INT type surrogate key. Another downside is
 the possible need of foreign keys and added complexity of insertions and
 multi-column updates.
 
 Have you found primary key lookups to be at least twice as fast as
 secondary key lookups with VARCHAR type primary key in InnoDB? The whole
 idea is based on the assumption that it is.

That's why you really need to be more precise in the data structures
you are planning on using. This can change the results significantly.

So no, I don't have any specific answers to your questions as you don't
provide any specific information in what you ask.

 Also, MyISAM conversion is an option too. Have you found the table
 maintenance to be a significant overhead? I've experienced MyISAM table
 corruptions in production and I'm more inclined to go with InnoDB for its
 reliability. This is a fairly important table.

Well disk (and memory) usage can also be important so as it seems
InnoDB storage is less efficient this may actually degrade
performance.  Until you are more concrete it's hard to say what will
work best for you.

Simon


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org



Re: splitting large tables vertically

2009-05-10 Thread Kyong Kim
 That's why you really need to be more precise in the data structures
 you are planning on using. This can change the results significantly.

 So no, I don't have any specific answers to your questions as you don't
 provide any specific information in what you ask.

Yeah. Let me see if I can follow up with more concrete information
sometime in future. I find performance tuning to be workload dependent and
it is difficult to project without having all the details.

 Well disk (and memory) usage can also be important so as it seems
 InnoDB storage is less efficient this may actually degrade
 performance.  Until you are more concrete it's hard to say what will
 work best for you.

At this point I'm fairly convinced that this idea of vertical paritioning
a table into column tables will degrade performance unless the workload is
tailor-made for this.
The cost of joins and index lookup/column data seems a bit too high for
almost any scenario.
Thanks for the prompt response. I'll follow up with you if I have more
concrete details.

Thanks
Kyong




 --
 MySQL General Mailing List
 For list archives: http://lists.mysql.com/mysql
 To unsubscribe:http://lists.mysql.com/mysql?unsub=kimky...@fhda.edu



Inst. Web Programmer
CMDBA 5.0

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org



Re: splitting large tables vertically

2009-05-09 Thread mos
Do the 3 tables have different column structures? Or do they all have the 
same table structure? For example, is Table1 storing only data for year 
1990 and table 2 storing data for 1991 etc? If so you could use a merge 
table. (Or do you need transactions, in which case you will  need to use 
InnoDb and merge tables are not possible)


Is there always a 1:1 relationship between the tables? If so, I'd recommend 
storing all the data in 1 table. Joins are really expensive and are to be 
avoided if at all possible possible.


One way I found to make joins faster when retrieving under 5k rows, is to 
select the rows from each table and put them into a temporary Memory table, 
build the primary index on each memory table, and then use the memory 
tables for the join. This seems to be more work, but it is about twice as 
fast as joining the physical tables together.


Hope this helps. If you could provide us with more information as how you 
will be adding and retrieving the data (how many rows are being retrieved 
in a query?), perhaps we can narrow down the solution for you.



Mike

At 07:44 AM 5/9/2009, Kyong Kim wrote:

I was wondering about a scale out problem.
Lets say you have a large table with 3 cols and 500+ million rows.

Would there be much benefit in splitting the columns into different tables
based on INT type primary keys across the tables? The split tables will be
hosted on a same physical instance but can be spread over multiple disks.
We're also open to splitting the query and reconstituting the data at the
application layer such as

select col1, col2 from t1 where col2='name';
select col2 from t2 where col1=t1.col1;
select col2 from t3 where col1=t1.col1;

as opposed to

select t1.col2, t2.col2, t3.col2 from t1 inner join t2 on t1.col1=t2.col1
inner join t3 on t1.col1=t3.col1;

My concern to this approach is the overhead of joins of such large number
of rows. I was doing some research into the cost of joins and as of 5.0,
the joins were still nested loop scans.

I was wondering if there are others with practical experience in this
matter and what they've found. Any feedback will be much appreciated.

Kyong
Inst. Web Programmer
CMDBA 5.0

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=mo...@fastmail.fm



--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org