Huge number of tables with InnoDB

2006-01-13 Thread John McCaskey
Hi everyone,

 

I'm running MySQL 4.0.18 on Debian with a 2.6 linux kernel using ext3 as
the underlying filesystem for the database storage.

 

I currently have some InnoDB tables with the following structure:

 

Log_20060101 {

Monitor_id medium int,

Timestamp timestamp,

Avg float,

PRIMARY KEY Monitor_id, Timestamp

}

 

We partition these tables by date as you can see as they grow very large
and they get to be slow to insert and query to over time.  We have the
idea to change the partitioning so the tables are as follows:

 

Log_[monitor_id] {

Timestamp,

Avg float,

PRIMARY KEY Timestamp

}

 

This seems to have several key advantages:

 

1)   Reduced disk space usage 

2)   Easier querying of data across time (but not across individual
id's, it turns out that doesn't ever really happen in our data usage
anyway though)

3)   Smaller tables, resulting in faster reads/writes, also smaller
data volume hopefully also resulting in faster reads/writes due to less
disk io neccesary

 

However, in our actual testing the 'faster writes' expectation is
getting shot down.  With 20,000 unique monitor id's and 8928 unique
timestamps inserting to the old set of tables (20051201-20051231 in this
case) is taking me about 4 hours 20 minutes.  Inserting to the 20,000
new tables (Log_0-Log_2) is taking about 10 hours.  I expected this
to be much faster as I hoped finding the right table for an insert would
be a Hash type lookup taking linear time, while inserting into the large
tree structure in the old tables which have a much higher volume of rows
would be logarithmic time.  Where did I go wrong?  My only real thought
so far is the disk subsystem of the OS being slow with large numbers of
files, but I thought it wouldn't matter for InnoDB as the data storage
is all one file.  Does anyone know what would cause the inserts to be so
much slower?  

 

John A. McCaskey

Software Development Engineer

Klir Technologies, Inc.

[EMAIL PROTECTED]

206.902.2027

 



Re: Huge number of tables with InnoDB

2006-01-13 Thread Alec . Cawley
To reply to this, I think we have to understand why you have chosen to 
split the tables at all. It seems to me that this, by introducing a 
two-level lookup, is certain to be slower than any possible single table 
lookup. Generally, Log A + log B is bound to be larger than log (A*B). 

It appears that you are querying prediminantly by time. In this case, your 
index *must* start with the timestamp, not the monitor ID. I would suggest 
that you need an index on timestamp, and possible one on monitor ID - but 
not if, as you say, you never query by monitor ID at all. Do you need a 
PRIMARY KEY at all? In what way will your system break if there happen to 
be two entries with the same timestamp and monitor ID? Presumably this 
will reflect two events very close together: Wouldn't you rather store 
that fact rather than lose it? 

Generally, I would query your decision to have multiple tables by date to 
whatever. In my experience, whenever I have introduced such concepts into 
my early designs, they have disappeareed later into a better design. It 
looks to mee as if you are using a tool optimied to do fast searches on 
large databases, then crippling its ability to optimise.

I would expect the use of thousands of tables effectively to disable 
MySQL's caching capability, which is one of the biggest performance 
boosters.

Alec






John McCaskey [EMAIL PROTECTED] 
13/01/2006 17:20

To
MySQL mysql@lists.mysql.com
cc

Subject
Huge number of tables with InnoDB






Hi everyone,

 

I'm running MySQL 4.0.18 on Debian with a 2.6 linux kernel using ext3 as
the underlying filesystem for the database storage.

 

I currently have some InnoDB tables with the following structure:

 

Log_20060101 {

Monitor_id medium int,

Timestamp timestamp,

Avg float,

PRIMARY KEY Monitor_id, Timestamp

}

 

We partition these tables by date as you can see as they grow very large
and they get to be slow to insert and query to over time.  We have the
idea to change the partitioning so the tables are as follows:

 

Log_[monitor_id] {

Timestamp,

Avg float,

PRIMARY KEY Timestamp

}

 

This seems to have several key advantages:

 

1)   Reduced disk space usage 

2)   Easier querying of data across time (but not across individual
id's, it turns out that doesn't ever really happen in our data usage
anyway though)

3)   Smaller tables, resulting in faster reads/writes, also smaller
data volume hopefully also resulting in faster reads/writes due to less
disk io neccesary

 

However, in our actual testing the 'faster writes' expectation is
getting shot down.  With 20,000 unique monitor id's and 8928 unique
timestamps inserting to the old set of tables (20051201-20051231 in this
case) is taking me about 4 hours 20 minutes.  Inserting to the 20,000
new tables (Log_0-Log_2) is taking about 10 hours.  I expected this
to be much faster as I hoped finding the right table for an insert would
be a Hash type lookup taking linear time, while inserting into the large
tree structure in the old tables which have a much higher volume of rows
would be logarithmic time.  Where did I go wrong?  My only real thought
so far is the disk subsystem of the OS being slow with large numbers of
files, but I thought it wouldn't matter for InnoDB as the data storage
is all one file.  Does anyone know what would cause the inserts to be so
much slower? 

 

John A. McCaskey

Software Development Engineer

Klir Technologies, Inc.

[EMAIL PROTECTED]

206.902.2027

 




-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



RE: Huge number of tables with InnoDB

2006-01-13 Thread John McCaskey
Alec,

Thanks for your response; however, I'm not convinced you are correct :)
Let me try to explain in more detail my reasoning below.

Why should the two level lookup be Log A + Log B?  Looking up a
tablename is always an equilvalency check, not a range check.  So there
is no advantage to having a tree style structure for the lookup.  The
structure I would expect to be used is a hash table which would then
make the two level lookup 1 + Log B, rather than Log(A*B) initially.

As to index's, we ALWAYS include the monitor id in the query and query
one monitor id at a time.  We generally also include a time range.  The
reason I have structured the index monitor id first then timestamp on
the old tables is that getting down to just one monitor id eliminates
more rows then getting down to one timestamp.  Second, you cannot have
two separate index's and have both used in a single query when using
InnoDB.  MySQL will choose the best index and use it.  Using this index
as the primary key instead of a separate index improves disk space usage
with InnoDB as the primary key doesn't require a separate index
structure but will order the rows correctly in the actual storage. 

As to events 'close to one another' we have a set interval of 5 minutes
for the timestamps, so that's not a concern, we do want the monitor_id,
timestamp pairs to be unique.

We insert many millions of these rows per day in our production systems,
we also require historical storage for up to one year.  Keeping that
volume of rows in one table causes both inserts and selects to come to a
crawl.  Unless I'm missing something here not partitioning the data is
not a remotely feasible option.

Comments?

John A. McCaskey
Software Development Engineer
Klir Technologies, Inc.
[EMAIL PROTECTED]
206.902.2027

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Friday, January 13, 2006 9:41 AM
To: John McCaskey
Cc: MySQL
Subject: Re: Huge number of tables with InnoDB

To reply to this, I think we have to understand why you have chosen to 
split the tables at all. It seems to me that this, by introducing a 
two-level lookup, is certain to be slower than any possible single table

lookup. Generally, Log A + log B is bound to be larger than log (A*B). 

It appears that you are querying prediminantly by time. In this case,
your 
index *must* start with the timestamp, not the monitor ID. I would
suggest 
that you need an index on timestamp, and possible one on monitor ID -
but 
not if, as you say, you never query by monitor ID at all. Do you need a 
PRIMARY KEY at all? In what way will your system break if there happen
to 
be two entries with the same timestamp and monitor ID? Presumably this 
will reflect two events very close together: Wouldn't you rather store 
that fact rather than lose it? 

Generally, I would query your decision to have multiple tables by date
to 
whatever. In my experience, whenever I have introduced such concepts
into 
my early designs, they have disappeareed later into a better design. It 
looks to mee as if you are using a tool optimied to do fast searches on 
large databases, then crippling its ability to optimise.

I would expect the use of thousands of tables effectively to disable 
MySQL's caching capability, which is one of the biggest performance 
boosters.

Alec






John McCaskey [EMAIL PROTECTED] 
13/01/2006 17:20

To
MySQL mysql@lists.mysql.com
cc

Subject
Huge number of tables with InnoDB






Hi everyone,

 

I'm running MySQL 4.0.18 on Debian with a 2.6 linux kernel using ext3 as
the underlying filesystem for the database storage.

 

I currently have some InnoDB tables with the following structure:

 

Log_20060101 {

Monitor_id medium int,

Timestamp timestamp,

Avg float,

PRIMARY KEY Monitor_id, Timestamp

}

 

We partition these tables by date as you can see as they grow very large
and they get to be slow to insert and query to over time.  We have the
idea to change the partitioning so the tables are as follows:

 

Log_[monitor_id] {

Timestamp,

Avg float,

PRIMARY KEY Timestamp

}

 

This seems to have several key advantages:

 

1)   Reduced disk space usage 

2)   Easier querying of data across time (but not across individual
id's, it turns out that doesn't ever really happen in our data usage
anyway though)

3)   Smaller tables, resulting in faster reads/writes, also smaller
data volume hopefully also resulting in faster reads/writes due to less
disk io neccesary

 

However, in our actual testing the 'faster writes' expectation is
getting shot down.  With 20,000 unique monitor id's and 8928 unique
timestamps inserting to the old set of tables (20051201-20051231 in this
case) is taking me about 4 hours 20 minutes.  Inserting to the 20,000
new tables (Log_0-Log_2) is taking about 10 hours.  I expected this
to be much faster as I hoped finding the right table for an insert would
be a Hash type

Re: Huge number of tables with InnoDB

2006-01-13 Thread Peter Brawley




Alec writes

...Generally, Log A + log B is bound to be larger than log
(A*B)... 

Errm, log A + log B exactly = log(A*B) 
:-) .

PB

-

[EMAIL PROTECTED] wrote:

  To reply to this, I think we have to understand why you have chosen to 
split the tables at all. It seems to me that this, by introducing a 
two-level lookup, is certain to be slower than any possible single table 
lookup. Generally, Log A + log B is bound to be larger than log (A*B). 

It appears that you are querying prediminantly by time. In this case, your 
index *must* start with the timestamp, not the monitor ID. I would suggest 
that you need an index on timestamp, and possible one on monitor ID - but 
not if, as you say, you never query by monitor ID at all. Do you need a 
PRIMARY KEY at all? In what way will your system break if there happen to 
be two entries with the same timestamp and monitor ID? Presumably this 
will reflect two events very close together: Wouldn't you rather store 
that fact rather than lose it? 

Generally, I would query your decision to have multiple tables by date to 
whatever. In my experience, whenever I have introduced such concepts into 
my early designs, they have disappeareed later into a better design. It 
looks to mee as if you are using a tool optimied to do fast searches on 
large databases, then crippling its ability to optimise.

I would expect the use of thousands of tables effectively to disable 
MySQL's caching capability, which is one of the biggest performance 
boosters.

Alec






"John McCaskey" [EMAIL PROTECTED] 
13/01/2006 17:20

To
"MySQL" mysql@lists.mysql.com
cc

Subject
Huge number of tables with InnoDB






Hi everyone,

 

I'm running MySQL 4.0.18 on Debian with a 2.6 linux kernel using ext3 as
the underlying filesystem for the database storage.

 

I currently have some InnoDB tables with the following structure:

 

Log_20060101 {

Monitor_id medium int,

Timestamp timestamp,

Avg float,

PRIMARY KEY Monitor_id, Timestamp

}

 

We partition these tables by date as you can see as they grow very large
and they get to be slow to insert and query to over time.  We have the
idea to change the partitioning so the tables are as follows:

 

Log_[monitor_id] {

Timestamp,

Avg float,

PRIMARY KEY Timestamp

}

 

This seems to have several key advantages:

 

1)   Reduced disk space usage 

2)   Easier querying of data across time (but not across individual
id's, it turns out that doesn't ever really happen in our data usage
anyway though)

3)   Smaller tables, resulting in faster reads/writes, also smaller
data volume hopefully also resulting in faster reads/writes due to less
disk io neccesary

 

However, in our actual testing the 'faster writes' expectation is
getting shot down.  With 20,000 unique monitor id's and 8928 unique
timestamps inserting to the old set of tables (20051201-20051231 in this
case) is taking me about 4 hours 20 minutes.  Inserting to the 20,000
new tables (Log_0-Log_2) is taking about 10 hours.  I expected this
to be much faster as I hoped finding the right table for an insert would
be a Hash type lookup taking linear time, while inserting into the large
tree structure in the old tables which have a much higher volume of rows
would be logarithmic time.  Where did I go wrong?  My only real thought
so far is the disk subsystem of the OS being slow with large numbers of
files, but I thought it wouldn't matter for InnoDB as the data storage
is all one file.  Does anyone know what would cause the inserts to be so
much slower? 

 

John A. McCaskey

Software Development Engineer

Klir Technologies, Inc.

[EMAIL PROTECTED]

206.902.2027

 




  



No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.371 / Virus Database: 267.14.17/228 - Release Date: 1/12/2006


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]