Hello,
I am seeking information on best practices with regards to Data Warehousing and MySQL. I am considering moving to PostgreSQL. I am currently using MySQL as the database of choice. I am now running into performance issues with regards to large tables. At the moment, I have the following general schema: * Table 1: 80,000,000 rows - 9.5 GB * Table 2: 1,000,000,000 rows - 8.9 GB And a few other misc tables (Period table to hold time values, which point to rows in Table 1 and Table 2). This is a generic star schema design for data warehousing. Information is added to this warehouse (DW) every five minutes. Information is deleted from this DW as well, after every five minutes. The data being recorded is time sensitive. As data ages, it may be deleted. Groups of samples are aggregated into a summary/aggregation sample prior to being deleted. I believe this is starting to cause fragmentation issues (please correct me if I am wrong) where old, deleted row positions get re-used by new rows on disk. This is possibly causing issues when aggregation needs to take place (eg, condense the last six samples of five minute intervals to a thirty minute sample). Would this possibly be causing overhead because the samples may be scattered around the disk instead of "being in a line", due to old, deleted row positions (further back in the database table) now being used for new data? Is this a concern I should have? The data in these tables could also be segmented/partitioned into their own individual tables by about a factor of six. For example, five minute samples are kept for 48 hours or so and then expire. 30 minute aggregation samples are kept for longer, etc. I believe performance can be improved by separating these different samples for one. At the moment with MySQL, DELETE'ing from the table (in batches of say 10,000 rows every few seconds due to a LIMIT clause on the DELETE query), rows are removed very slowly. Additionally, INSERT's are sometimes very slow. I believe this is just due to the sheer amount of data involved in this single table. I have read that it is better if perhaps partitioning is implemented, where new data is added to a partitioned table (eg, that represents a single day) and then when those samples expire - simply drop the partition. I believe partitioning would solve issues with SELECT and INSERT performance because the actual index tree size (and data in the table itself) would be reduced. Additionally, expired samples can simply have their owning table/partition dropped. This would (as I see it) eliminate some performance issues with new rows being inserted, and aggregation of existing rows would only need to check a smaller subset of the overall table due to looking at only a handful of partitions instead of them all. Can anyone provide any insight on whether they have experienced these performance issues before? Keep in mind these issues are on MySQL, and I am wondering whether moving to PostgreSQL may be able to assist me. Concurrent DELETE's and INSERT's every five minutes also hurts clients trying to retrieve statistics, as the tables run on MyISAM (Table locks are needed for exclusive writes). From what I can tell, PostgreSQL does not require table locking for concurrent READ/INSERT/DELETE statements. Should I be worried about such a large table? My main concern is that its so slow because the index tree needs to be updated with each insert (and the tree is no doubt quite big with so many rows already). Will partitioning help here? I believe it would assist somewhat. I am a bit hesitant however to go with PostgreSQL because the partitioning system seems a bit less easier to work with than MySQL (5.1's) implementation; as I would need to maintain my own master table for clients to query, and I do not think partition pruning is in use at this time (eg, the analyser can ignore partitions that wont contain any matching data based on the query being issued). Any comments or assistance will be appreciated. Regards, Andrew