Data Warehousing and MySQL vs PostgreSQL

Andrew Armstrong Thu, 26 Jul 2007 01:37:58 -0700

Hello,


I am seeking information on best practices with regards to Data Warehousing
and MySQL. I am considering moving to PostgreSQL.

 

I am currently using MySQL as the database of choice. I am now running into
performance issues with regards to large tables.

 

At the moment, I have the following general schema:

 

*       Table 1: 80,000,000 rows - 9.5 GB
*       Table 2: 1,000,000,000 rows - 8.9 GB

 

And a few other misc tables (Period table to hold time values, which point
to rows in Table 1 and Table 2).

 

This is a generic star schema design for data warehousing.

 

Information is added to this warehouse (DW) every five minutes.

 

Information is deleted from this DW as well, after every five minutes.

 

The data being recorded is time sensitive. As data ages, it may be deleted.
Groups of samples are aggregated into a summary/aggregation sample prior to
being deleted.

 

I believe this is starting to cause fragmentation issues (please correct me
if I am wrong) where old, deleted row positions get re-used by new rows on
disk.

This is possibly causing issues when aggregation needs to take place (eg,
condense the last six samples of five minute intervals to a thirty minute
sample).

 

Would this possibly be causing overhead because the samples may be scattered
around the disk instead of "being in a line", due to old, deleted row
positions (further back in the database table) now being used for new data?
Is this a concern I should have?

 

The data in these tables could also be segmented/partitioned into their own
individual tables by about a factor of six. For example, five minute samples
are kept for 48 hours or so and then expire. 30 minute aggregation samples
are kept for longer, etc. I believe performance can be improved by
separating these different samples for one.

 

At the moment with MySQL, DELETE'ing from the table (in batches of say
10,000 rows every few seconds due to a LIMIT clause on the DELETE query),
rows are removed very slowly.

Additionally, INSERT's are sometimes very slow.

 

I believe this is just due to the sheer amount of data involved in this
single table.

 

I have read that it is better if perhaps partitioning is implemented, where
new data is added to a partitioned table (eg, that represents a single day)
and then when those samples expire - simply drop the partition.

 

I believe partitioning would solve issues with SELECT and INSERT performance
because the actual index tree size (and data in the table itself) would be
reduced. Additionally, expired samples can simply have their owning
table/partition dropped. This would (as I see it) eliminate some performance
issues with new rows being inserted, and aggregation of existing rows would
only need to check a smaller subset of the overall table due to looking at
only a handful of partitions instead of them all. 

 

Can anyone provide any insight on whether they have experienced these
performance issues before? Keep in mind these issues are on MySQL, and I am
wondering whether moving to PostgreSQL may be able to assist me.

 

Concurrent DELETE's and INSERT's every five minutes also hurts clients
trying to retrieve statistics, as the tables run on MyISAM (Table locks are
needed for exclusive writes). From what I can tell, PostgreSQL does not
require table locking for concurrent READ/INSERT/DELETE statements.

 

Should I be worried about such a large table? My main concern is that its so
slow because the index tree needs to be updated with each insert (and the
tree is no doubt quite big with so many rows already).

Will partitioning help here? I believe it would assist somewhat.

 

I am a bit hesitant however to go with PostgreSQL because the partitioning
system seems a bit less easier to work with than MySQL (5.1's)
implementation; as I would need to maintain my own master table for clients
to query, and I do not think partition pruning is in use at this time (eg,
the analyser can ignore partitions that wont contain any matching data based
on the query being issued).

 

Any comments or assistance will be appreciated.

 

Regards,

Andrew

Data Warehousing and MySQL vs PostgreSQL

Reply via email to